[
https://issues.apache.org/jira/browse/CONNECTORS-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246612#comment-14246612
]
Karl Wright commented on CONNECTORS-1122:
-----------------------------------------
The fundamental issue is that document bins are *not* stored in the schema.
Connectors produce the document bins for a given document in code. When a job
starts, certain documents in the job's queue are put into a state where they
need priorities to be determined. Similarly, when a job is aborted, documents
that had priorities in that job beforehand have to have those priorities
rescinded. In both cases, since document bins are global, the allocation of
document priorities is suddenly incorrect, if there are other documents in
other jobs that have document priorities assigned which share the same document
bins as those documents whose state is being changed. This is why, at the
moment, ManifoldCF takes the approach of reprioritizing all documents at the
time when (say) jobs start or end.
At job start time, if only the documents being marked active for the new job
were marked, then any documents present whose bins overlapped existing jobs
would find that they would be placed at the back of the line. *No* documents
from the overlapping bins would be processed in the new job until *all* the
documents currently prioritized in the older jobs were processed.
At job end time, when you rescind document priorities, there are suddenly
"holes" in the prioritization, and the efficiency of ManifoldCF document
distribution becomes lower.
For the start case, it may be acceptable to not fully reprioritize. This is
one change that would be easy to explore. For the job abort case, it's not
going to work; the reprioritization must take place.
> Explore ways to make job start be faster in systems with lots of documents
> --------------------------------------------------------------------------
>
> Key: CONNECTORS-1122
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1122
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Framework crawler agent
> Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0
> Reporter: Karl Wright
> Assignee: Karl Wright
> Fix For: ManifoldCF 1.9, ManifoldCF 2.1
>
>
> Job start requires all documents to be marked as needing reprioritization
> now. We should consider ways in which we can reduce the need to do this as
> much as possible. For example, if there are NO documents at all for a job,
> reprioritization is by definition unneeded. Alternatively, coming up with a
> way of determining if there are any bin-level overlaps between documents made
> active by a job start at documents elsewhere, we could be more targeted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)