[ 
https://issues.apache.org/jira/browse/CONNECTORS-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239599#comment-14239599
 ] 

Karl Wright commented on CONNECTORS-1118:
-----------------------------------------

The best solution would be to promote the PipelineConnections and 
PipelineConnectionsWithVersions classes to be first-class API-level objects, 
preferably with some interface depiction, e.g. IPipelineConnections and 
IPipelineConnectionsWithVersions.  All the methods in IIncrementalIngester 
would be changed to take IPipelineConnections inputs instead of 
IPipelineSpecification objects.  Then it would be possible to cache the objects 
for at least the duration of a single document's processing.

This is not a trivial change and will require some time to implement.

It's also worth noting that the *reason* for the locking in this case is for 
cache management.  The objects that are being loaded are in fact cached objects 
constructed from their database images -- locking is needed to insure cache 
consistency only.  If zookeeper is so slow that it is dragging down even our 
caching implementation, we should seriously consider chucking it in favor of 
another solution.

> Documents processed by the shared drive connector incur an unnecessary 
> synchronisation hit
> ------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1118
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1118
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework core
>    Affects Versions: ManifoldCF 1.7.2
>            Reporter: Aeham Abushwashi
>            Assignee: Karl Wright
>
> Each document processed by the shared drive connector is passed through 
> SharedDriveConnector#checkInclude to verify whether the document is eligible 
> for ingestion. The calls made here to 
> WorkerThread$ProcessActivity#checkMimeTypeIndexable and 
> WorkerThread$ProcessActivity#checkLengthIndexable are unnecessarily costly as 
> they each create a fresh instance of IncrementalIngester$PipelineConnections 
> on every call. The constructor of IncrementalIngester$PipelineConnections can 
> be very expensive due to the loading of output connection objects, which in 
> turn requires some locking (via ZK - in a distrubuted environment).
> The other area of inefficiency is in 
> WorkerThread$ProcessActivity#processDocumentReferences. This method creates 
> new instances of PriorityCalculator using the less-efficient 3-arg 
> constructor. This can be addressed using the same pattern implemented for 
> CONNECTORS-1094
> To highlight the impact of the above calls, I profiled an active worker 
> thread for 40 minutes. During that window, it spent ~23 minutes in 
> SharedDriveConnector#checkInclude and its callees + 9 minutes creating 
> instances of PriorityCalculator.
> I've seen the above issues when using the shared drive connector but I think 
> other connectors too could be impacted - depending on how they're implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to