[
https://issues.apache.org/jira/browse/CONNECTORS-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377280#comment-16377280
]
Ahmed Mahfouz commented on CONNECTORS-1497:
-------------------------------------------
[[email protected]] I thought of that but I didn't want to change how
manifold works for continuous jobs I wanted to just limit to the jobs with
re-crawl interval infinity (checkTimeValue is null) to be able to reindex the
modified documents seeds right away. Still with override schedule set to true
the status of PENDINGPURGATORY is a hurdle to modify the executionTime.
{code:java}
/** Update an existing record (as the result of an initial add).
* The record is presumed to exist and have been locked, via "FOR UPDATE".
*/
public void updateExistingRecordInitial(Long recordID, int currentStatus, Long
checkTimeValue,
long desiredExecuteTime, IPriorityCalculator desiredPriority, String[]
prereqEvents,
String processID)
throws ManifoldCFException
{
// The general rule here is:
// If doesn't exist, make a PENDING entry.
// If PENDING, keep it as PENDING.
// If COMPLETE, make a PENDING entry.
// If PURGATORY, make a PENDINGPURGATORY entry.
// Leave everything else alone and do nothing.
HashMap map = new HashMap();
switch (currentStatus)
{
case STATUS_ACTIVE:
case STATUS_ACTIVEPURGATORY:
case STATUS_ACTIVENEEDRESCAN:
case STATUS_ACTIVENEEDRESCANPURGATORY:
case STATUS_BEINGCLEANED:
// These are all the active states. Being in this state implies that a thread
may be working on the document. We
// must not interrupt it.
// Initial adds never bring along any carrydown info, so we should be satisfied
as long as the record exists.
break;
case STATUS_COMPLETE:
case STATUS_UNCHANGED:
case STATUS_PURGATORY:
// Set the status and time both
map.put(statusField,statusToString(STATUS_PENDINGPURGATORY));
TrackerClass.noteRecordChange(recordID, STATUS_PENDINGPURGATORY, "Update
existing record initial");
if (desiredExecuteTime == -1L)
map.put(checkTimeField,new Long(0L));
else
map.put(checkTimeField,new Long(desiredExecuteTime));
map.put(checkActionField,actionToString(ACTION_RESCAN));
map.put(failTimeField,null);
map.put(failCountField,null);
// Update the doc priority.
map.put(docPriorityField,new Double(desiredPriority.getDocumentPriority()));
map.put(needPriorityField,needPriorityToString(NEEDPRIORITY_FALSE));
break;
case STATUS_PENDING:
// Bump up the schedule if called for
Long cv = checkTimeValue;
if (cv != null)
{
long currentExecuteTime = cv.longValue();
if ((desiredExecuteTime == -1L ||currentExecuteTime <= desiredExecuteTime))
{
break;
}
}
else
{
if (desiredExecuteTime == -1L)
{
break;
}
}
map.put(checkTimeField,new Long(desiredExecuteTime));
map.put(checkActionField,actionToString(ACTION_RESCAN));
map.put(failTimeField,null);
map.put(failCountField,null);
// The existing doc priority field should be preserved.
break;
case STATUS_PENDINGPURGATORY:
// In this case we presume that the reason we are in this state is due to
adaptive crawling or retry, so DON'T bump up the schedule!
// The existing doc priority field should also be preserved.
break;
default:
break;
}
map.put(isSeedField,seedstatusToString(SEEDSTATUS_NEWSEED));
map.put(seedingProcessIDField,processID);
// Delete any existing prereqevent entries first
prereqEventManager.deleteRows(recordID);
ArrayList list = new ArrayList();
String query = buildConjunctionClause(list,new ClauseDescription[]{
new UnitaryClause(idField,recordID)});
performUpdate(map,"WHERE "+query,list,null);
// Insert prereqevent entries, if any
prereqEventManager.addRows(recordID,prereqEvents);
noteModifications(0,1,0);
}
{code}
> Re-index seeded modified documents when the re-crawl interval is infinity and
> connector model is MODEL_ADD_CHANGE
> -------------------------------------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1497
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1497
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Framework agents process
> Affects Versions: ManifoldCF 2.9.1
> Reporter: Ahmed Mahfouz
> Assignee: Karl Wright
> Priority: Major
> Attachments: CONNECTORS-1497.patch
>
>
> Trying to avoid a full scan of all documents for a better efficiency with a
> large number of documents. I tried so many different setting for the Jobs but
> I couldn't accomplish that. Especially when the repository connector model is
> MODEL_ADD_CHANGE I was expecting the modified documents seeded should be
> re-indexed immediately similar to the new seeds but I found out it uses the
> re-crawl time as the scheduled time and it waits for the full scan to get
> re-indexed. I avoided full scan by setting the re-crawl interval to infinity
> but still, my modified documents seeds were not getting indexed. After
> digging into the code for quite good time. I did some modification to the
> JobManager and it worked for me. I would like to share the change with you
> for review so I opened this ticket.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)