[
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900288#comment-13900288
]
Karl Wright commented on CONNECTORS-850:
----------------------------------------
Here's the algorithm that MCF uses to calculate when to refetch a document in
dynamic crawling.
First, it keeps track, over all time, of the first time the document was
fetched, and the last time it was fetched, and the number of changes that took
place in-between, to come up with an estimated value for the average time
between changes. When you change the document, of course, this value is
affected, but may not be affected that strongly if the document had a long
period of stability. (If you want to make this history go away for a document,
you can click the "reindex all documents" link on the output connection's view
page. That causes MCF to forget everything about what's been indexed before.)
The actual time determined for the next fetch is calculated here:
{code}
public Long calculateDocumentRescheduleTime(long currentTime, long timeAmt,
String localIdentifier)
{
Long recrawlTime = null;
Long recrawlInterval = job.getInterval();
if (recrawlInterval != null)
{
Long maxInterval = job.getMaxInterval();
long actualInterval = recrawlInterval.longValue() + timeAmt;
if (maxInterval != null && actualInterval > maxInterval.longValue())
actualInterval = maxInterval.longValue();
recrawlTime = new Long(currentTime + actualInterval);
}
if (Logging.scheduling.isDebugEnabled())
Logging.scheduling.debug("Default rescan time for document
'"+localIdentifier+"' is
"+((recrawlTime==null)?"NEVER":recrawlTime.toString()));
Long lowerBound = getDocumentRescheduleLowerBoundTime(localIdentifier);
if (lowerBound != null)
{
if (recrawlTime == null || recrawlTime.longValue() <
lowerBound.longValue())
{
recrawlTime = lowerBound;
if (Logging.scheduling.isDebugEnabled())
Logging.scheduling.debug(" Rescan time overridden for document
'"+localIdentifier+"' due to lower bound; new value is
"+recrawlTime.toString());
}
}
Long upperBound = getDocumentRescheduleUpperBoundTime(localIdentifier);
if (upperBound != null)
{
if (recrawlTime == null || recrawlTime.longValue() >
upperBound.longValue())
{
recrawlTime = upperBound;
if (Logging.scheduling.isDebugEnabled())
Logging.scheduling.debug(" Rescan time overridden for document
'"+localIdentifier+"' due to upper bound; new value is
"+recrawlTime.toString());
}
}
return recrawlTime;
}
{code}
As you can see, both the average interval between fetches (timeAmt), and what
the connector sets as far as time bounds are concerned, go into the
calculation. The minimum recrawl interval (job.getInterval()) and the maximum
recrawl interval (job.getMaxInterval()) are also important. The key part of
the calculation is as follows:
{code}
Long maxInterval = job.getMaxInterval();
long actualInterval = recrawlInterval.longValue() + timeAmt;
if (maxInterval != null && actualInterval > maxInterval.longValue())
actualInterval = maxInterval.longValue();
recrawlTime = new Long(currentTime + actualInterval);
{code}
The actual interval chosen is the job's minimum recrawl interval, plus the
average time between changes for the document, capped by the job's maximum
recrawl interval.
Hope that clarifies things.
> Maximum interval in dynamic crawling
> ------------------------------------
>
> Key: CONNECTORS-850
> URL: https://issues.apache.org/jira/browse/CONNECTORS-850
> Project: ManifoldCF
> Issue Type: New Feature
> Components: Framework crawler agent
> Affects Versions: ManifoldCF 1.4.1
> Reporter: Florian Schmedding
> Assignee: Karl Wright
> Priority: Minor
> Labels: features
> Fix For: ManifoldCF 1.5
>
>
> Currently, the dynamic crawling method used for a continuous job extends the
> reseed and recrawl intervals when no changes are found in a checked document.
> However, it should be possible to restrict this extension to a maximum value
> in order to make sure that new documents are discovered within a certain
> interval.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)