[ 
https://issues.apache.org/jira/browse/CONNECTORS-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246612#comment-14246612
 ] 

Karl Wright commented on CONNECTORS-1122:
-----------------------------------------

The fundamental issue is that document bins are *not* stored in the schema.  
Connectors produce the document bins for a given document in code.  When a job 
starts, certain documents in the job's queue are put into a state where they 
need priorities to be determined.  Similarly, when a job is aborted, documents 
that had priorities in that job beforehand have to have those priorities 
rescinded.  In both cases, since document bins are global, the allocation of 
document priorities is suddenly incorrect, if there are other documents in 
other jobs that have document priorities assigned which share the same document 
bins as those documents whose state is being changed.  This is why, at the 
moment, ManifoldCF takes the approach of reprioritizing all documents at the 
time when (say) jobs start or end.

At job start time, if only the documents being marked active for the new job 
were marked, then any documents present whose bins overlapped existing jobs 
would find that they would be placed at the back of the line. *No* documents 
from the overlapping bins would be processed in the new job until *all* the 
documents currently prioritized in the older jobs were processed.

At job end time, when you rescind document priorities, there are suddenly 
"holes" in the prioritization, and the efficiency of ManifoldCF document 
distribution becomes lower.

For the start case, it may be acceptable to not fully reprioritize.  This is 
one change that would be easy to explore.  For the job abort case, it's not 
going to work; the reprioritization must take place.


> Explore ways to make job start be faster in systems with lots of documents
> --------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1122
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1122
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>    Affects Versions: ManifoldCF 1.8, ManifoldCF 2.0
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.9, ManifoldCF 2.1
>
>
> Job start requires all documents to be marked as needing reprioritization 
> now.  We should consider ways in which we can reduce the need to do this as 
> much as possible.  For example, if there are NO documents at all for a job, 
> reprioritization is by definition unneeded.  Alternatively, coming up with a 
> way of determining if there are any bin-level overlaps between documents made 
> active by a job start at documents elsewhere, we could be more targeted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to