[
https://issues.apache.org/jira/browse/CONNECTORS-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239599#comment-14239599
]
Karl Wright commented on CONNECTORS-1118:
-----------------------------------------
The best solution would be to promote the PipelineConnections and
PipelineConnectionsWithVersions classes to be first-class API-level objects,
preferably with some interface depiction, e.g. IPipelineConnections and
IPipelineConnectionsWithVersions. All the methods in IIncrementalIngester
would be changed to take IPipelineConnections inputs instead of
IPipelineSpecification objects. Then it would be possible to cache the objects
for at least the duration of a single document's processing.
This is not a trivial change and will require some time to implement.
It's also worth noting that the *reason* for the locking in this case is for
cache management. The objects that are being loaded are in fact cached objects
constructed from their database images -- locking is needed to insure cache
consistency only. If zookeeper is so slow that it is dragging down even our
caching implementation, we should seriously consider chucking it in favor of
another solution.
> Documents processed by the shared drive connector incur an unnecessary
> synchronisation hit
> ------------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1118
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Framework core
> Affects Versions: ManifoldCF 1.7.2
> Reporter: Aeham Abushwashi
> Assignee: Karl Wright
>
> Each document processed by the shared drive connector is passed through
> SharedDriveConnector#checkInclude to verify whether the document is eligible
> for ingestion. The calls made here to
> WorkerThread$ProcessActivity#checkMimeTypeIndexable and
> WorkerThread$ProcessActivity#checkLengthIndexable are unnecessarily costly as
> they each create a fresh instance of IncrementalIngester$PipelineConnections
> on every call. The constructor of IncrementalIngester$PipelineConnections can
> be very expensive due to the loading of output connection objects, which in
> turn requires some locking (via ZK - in a distrubuted environment).
> The other area of inefficiency is in
> WorkerThread$ProcessActivity#processDocumentReferences. This method creates
> new instances of PriorityCalculator using the less-efficient 3-arg
> constructor. This can be addressed using the same pattern implemented for
> CONNECTORS-1094
> To highlight the impact of the above calls, I profiled an active worker
> thread for 40 minutes. During that window, it spent ~23 minutes in
> SharedDriveConnector#checkInclude and its callees + 9 minutes creating
> instances of PriorityCalculator.
> I've seen the above issues when using the shared drive connector but I think
> other connectors too could be impacted - depending on how they're implemented.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)