Aeham Abushwashi created CONNECTORS-1118:
--------------------------------------------
Summary: Documents processed by the shared drive connector incur
an unnecessary synchronisation hit
Key: CONNECTORS-1118
URL: https://issues.apache.org/jira/browse/CONNECTORS-1118
Project: ManifoldCF
Issue Type: Improvement
Components: Framework core
Affects Versions: ManifoldCF 1.7.2
Reporter: Aeham Abushwashi
Each document processed by the shared drive connector is passed through
SharedDriveConnector#checkInclude to verify whether the document is eligible
for ingestion. The calls made here to
WorkerThread$ProcessActivity#checkMimeTypeIndexable and
WorkerThread$ProcessActivity#checkLengthIndexable are unnecessarily costly as
they each create a fresh instance of IncrementalIngester$PipelineConnections on
every call. The constructor of IncrementalIngester$PipelineConnections can be
very expensive due to the loading of output connection objects, which in turn
requires some locking (via ZK - in a distrubuted environment).
The other area of inefficiency is in
WorkerThread$ProcessActivity#processDocumentReferences. This method creates new
instances of PriorityCalculator using the less-efficient 3-arg constructor.
This can be addressed using the same pattern implemented for CONNECTORS-1094
To highlight the impact of the above calls, I profiled an active worker thread
for 40 minutes. During that window, it spent ~23 minutes in
SharedDriveConnector#checkInclude and its callees + 9 minutes creating
instances of PriorityCalculator.
I've seen the above issues when using the shared drive connector but I think
other connectors too could be impacted - depending on how they're implemented.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)