Aeham Abushwashi created CONNECTORS-1118:
--------------------------------------------

             Summary: Documents processed by the shared drive connector incur 
an unnecessary synchronisation hit
                 Key: CONNECTORS-1118
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1118
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Framework core
    Affects Versions: ManifoldCF 1.7.2
            Reporter: Aeham Abushwashi


Each document processed by the shared drive connector is passed through 
SharedDriveConnector#checkInclude to verify whether the document is eligible 
for ingestion. The calls made here to 
WorkerThread$ProcessActivity#checkMimeTypeIndexable and 
WorkerThread$ProcessActivity#checkLengthIndexable are unnecessarily costly as 
they each create a fresh instance of IncrementalIngester$PipelineConnections on 
every call. The constructor of IncrementalIngester$PipelineConnections can be 
very expensive due to the loading of output connection objects, which in turn 
requires some locking (via ZK - in a distrubuted environment).

The other area of inefficiency is in 
WorkerThread$ProcessActivity#processDocumentReferences. This method creates new 
instances of PriorityCalculator using the less-efficient 3-arg constructor. 
This can be addressed using the same pattern implemented for CONNECTORS-1094

To highlight the impact of the above calls, I profiled an active worker thread 
for 40 minutes. During that window, it spent ~23 minutes in 
SharedDriveConnector#checkInclude and its callees + 9 minutes creating 
instances of PriorityCalculator.

I've seen the above issues when using the shared drive connector but I think 
other connectors too could be impacted - depending on how they're implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to