[jira] [Commented] (CONNECTORS-1497) Re-index seeded modified documents when the re-crawl interval is infinity and connector model is MODEL_ADD_CHANGE

Ahmed Mahfouz (JIRA) Mon, 26 Feb 2018 10:00:46 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377280#comment-16377280
 ]


Ahmed Mahfouz commented on CONNECTORS-1497:
-------------------------------------------

[[email protected]] I thought of that but I didn't want to change how 
manifold works for continuous jobs I wanted to just limit to the jobs with 
re-crawl interval infinity (checkTimeValue is null) to be able to reindex the 
modified documents seeds right away. Still with override schedule set to true 
the status of PENDINGPURGATORY is a hurdle to modify the executionTime.
{code:java}
/** Update an existing record (as the result of an initial add).
* The record is presumed to exist and have been locked, via "FOR UPDATE".
*/
public void updateExistingRecordInitial(Long recordID, int currentStatus, Long 
checkTimeValue,
long desiredExecuteTime, IPriorityCalculator desiredPriority, String[] 
prereqEvents,
String processID)
throws ManifoldCFException
{
// The general rule here is:
// If doesn't exist, make a PENDING entry.
// If PENDING, keep it as PENDING. 
// If COMPLETE, make a PENDING entry.
// If PURGATORY, make a PENDINGPURGATORY entry.
// Leave everything else alone and do nothing.

HashMap map = new HashMap();
switch (currentStatus)
{
case STATUS_ACTIVE:
case STATUS_ACTIVEPURGATORY:
case STATUS_ACTIVENEEDRESCAN:
case STATUS_ACTIVENEEDRESCANPURGATORY:
case STATUS_BEINGCLEANED:
// These are all the active states. Being in this state implies that a thread 
may be working on the document. We
// must not interrupt it.
// Initial adds never bring along any carrydown info, so we should be satisfied 
as long as the record exists.
break;

case STATUS_COMPLETE:
case STATUS_UNCHANGED:
case STATUS_PURGATORY:
// Set the status and time both
map.put(statusField,statusToString(STATUS_PENDINGPURGATORY));
TrackerClass.noteRecordChange(recordID, STATUS_PENDINGPURGATORY, "Update 
existing record initial");
if (desiredExecuteTime == -1L)
map.put(checkTimeField,new Long(0L));
else
map.put(checkTimeField,new Long(desiredExecuteTime));
map.put(checkActionField,actionToString(ACTION_RESCAN));
map.put(failTimeField,null);
map.put(failCountField,null);
// Update the doc priority.
map.put(docPriorityField,new Double(desiredPriority.getDocumentPriority()));
map.put(needPriorityField,needPriorityToString(NEEDPRIORITY_FALSE));
break;

case STATUS_PENDING:
// Bump up the schedule if called for
Long cv = checkTimeValue;
if (cv != null)
{
long currentExecuteTime = cv.longValue();
if ((desiredExecuteTime == -1L ||currentExecuteTime <= desiredExecuteTime))
{
break;
}
}
else
{
if (desiredExecuteTime == -1L)
{
break;
}
}
map.put(checkTimeField,new Long(desiredExecuteTime));
map.put(checkActionField,actionToString(ACTION_RESCAN));
map.put(failTimeField,null);
map.put(failCountField,null);
// The existing doc priority field should be preserved.
break;

case STATUS_PENDINGPURGATORY:
// In this case we presume that the reason we are in this state is due to 
adaptive crawling or retry, so DON'T bump up the schedule!
// The existing doc priority field should also be preserved.
break;

default:
break;

}
map.put(isSeedField,seedstatusToString(SEEDSTATUS_NEWSEED));
map.put(seedingProcessIDField,processID);
// Delete any existing prereqevent entries first
prereqEventManager.deleteRows(recordID);
ArrayList list = new ArrayList();
String query = buildConjunctionClause(list,new ClauseDescription[]{
new UnitaryClause(idField,recordID)});
performUpdate(map,"WHERE "+query,list,null);
// Insert prereqevent entries, if any
prereqEventManager.addRows(recordID,prereqEvents);
noteModifications(0,1,0);
}
{code}

> Re-index seeded modified documents when the re-crawl interval is infinity and 
>   connector model is MODEL_ADD_CHANGE
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1497
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1497
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 2.9.1
>            Reporter: Ahmed Mahfouz
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: CONNECTORS-1497.patch
>
>
> Trying to avoid a full scan of all documents for a better efficiency with a 
> large number of documents. I tried so many different setting for the Jobs but 
> I couldn't accomplish that. Especially when the repository connector model is 
> MODEL_ADD_CHANGE I was expecting the modified documents seeded should be 
> re-indexed immediately similar to the new seeds but I found out it uses the 
> re-crawl time as the scheduled time and it waits for the full scan to get 
> re-indexed. I avoided full scan by setting the re-crawl interval to infinity 
> but still, my modified documents seeds were not getting indexed. After 
> digging into the code for quite good time. I did some modification to the 
> JobManager and it worked for me. I would like to share the change with you 
> for review so I opened this ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1497) Re-index seeded modified documents when the re-crawl interval is infinity and connector model is MODEL_ADD_CHANGE

Reply via email to