igiguere opened a new pull request, #880:
URL: https://github.com/apache/nutch/pull/880

   ### Ticket
   https://issues.apache.org/jira/browse/NUTCH-1564
   
   ### Description
   For a full description of the issue, please refer to the ASF Jira ticket.
   
   ### Solution
   If the `offset` calculated from the `delta` (difference between last fetch 
time and last modification time) and `sync_delta_rate` is larger than the 
`max_interval`, then, the `offset` is re-calculated proportionaly to the 
`max_interval`.
   This ensures that when the `interval` (most likely the `max_interval`) is 
added to the `refTime`, the resulting new `fetchTime` is not is the past, 
triggering an immediate re-fetch.
   
   Note that I also played with some "brute force" ideas:
   - if `offset` > `max_interval`, then set `refTime` to current `fetchTime`
   - if `offset` > `max_interval`, then re-set `offset` to `offset` - 
`max_interval` (i.e.: 9-7=2), then, calculate `refTime` as before from that. 
(equivalent to `fetchTime` - 2, in the example)
   
   The suggested approach allows a smooth-ish selection of the next fetch time, 
relative to the gap between fetch time and last modification time.
   
   ### Tests
   Unit tests added, illustrating a few situations based on the description of 
NUTCH-1564.
   
   Functional tests to be done on a long-running installation... which I don't 
have.
   
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to