[
https://issues.apache.org/jira/browse/CONNECTORS-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377380#comment-16377380
]
Karl Wright commented on CONNECTORS-1497:
-----------------------------------------
Ok, I've had a chance to look in more detail.
First, really the only change you want to make is have
JobQueue.updateExistingRecordInitial()'s case statement do the same thing for
STATUS_PENDING and STATUS_PENDINGPURGATORY, e.g.:
{code}
case STATUS_PENDINGPURGATORY:
case STATUS_PENDING:
... code as before
{code}
Second, make sure addDocumentsInitial() gets called with
overrideDocumentSchedule set.
As for committing these changes to trunk, I'm still leaning against this.
Seeding is not supposed to set document schedule; it's only supposed to make
documents eligible for processing. The fact that you are using an infinite
recrawl time for your documents is not a good reason to make a change of this
kind; what you are in effect trying to do is exactly what I said at the outset:
run your job completely multiple times, not continuously once.
We usually recommend people schedule "minimal" job runs often and "complete"
runs once in a while to purge deleted documents. This has the advantage of
giving you control over when the expensive 'crawl everything' delete takes
place. I still think that's the best model for you, barring any other
information.
> Re-index seeded modified documents when the re-crawl interval is infinity and
> connector model is MODEL_ADD_CHANGE
> -------------------------------------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1497
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1497
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Framework agents process
> Affects Versions: ManifoldCF 2.9.1
> Reporter: Ahmed Mahfouz
> Assignee: Karl Wright
> Priority: Major
> Attachments: CONNECTORS-1497.patch, CONNECTORS-1497.patch2
>
>
> Trying to avoid a full scan of all documents for a better efficiency with a
> large number of documents. I tried so many different setting for the Jobs but
> I couldn't accomplish that. Especially when the repository connector model is
> MODEL_ADD_CHANGE I was expecting the modified documents seeded should be
> re-indexed immediately similar to the new seeds but I found out it uses the
> re-crawl time as the scheduled time and it waits for the full scan to get
> re-indexed. I avoided full scan by setting the re-crawl interval to infinity
> but still, my modified documents seeds were not getting indexed. After
> digging into the code for quite good time. I did some modification to the
> JobManager and it worked for me. I would like to share the change with you
> for review so I opened this ticket.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)