[
https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050575#comment-18050575
]
ASF GitHub Bot commented on NUTCH-1564:
---------------------------------------
sebastian-nagel commented on code in PR #880:
URL: https://github.com/apache/nutch/pull/880#discussion_r2671659078
##########
src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:
##########
@@ -389,7 +402,8 @@ public static void main(String[] args) throws Exception {
(p.getFetchInterval() / SECONDS_PER_DAY), miss);
if (p.getFetchTime() <= curTime) {
fetchCnt++;
- fs.setFetchSchedule(new Text("http://www.example.com"), p, p
+ // why was "http://www.example.com" hard-coded here?
Review Comment:
Likely, because a URL is required by the API, although it is not relevant
here. It's ok to use an empty string here. But the comment should be also
removed.
##########
src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:
##########
@@ -332,17 +333,29 @@ public CrawlDatum setFetchSchedule(Text url, CrawlDatum
datum,
case FetchSchedule.STATUS_UNKNOWN:
break;
}
- if (SYNC_DELTA) {
- // try to synchronize with the time of change
- long delta = (fetchTime - modifiedTime) / 1000L;
- if (delta > interval)
- interval = delta;
- refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
- }
// Ensure the interval does not fall outside of bounds
float minInterval = (getCustomMinInterval(url) != null) ?
getCustomMinInterval(url) : MIN_INTERVAL;
float maxInterval = (getCustomMaxInterval(url) != null) ?
getCustomMaxInterval(url) : MAX_INTERVAL;
+
+ if (SYNC_DELTA) {
+ // try to synchronize with the time of change
+ long delta = (fetchTime - modifiedTime);
+ if (delta > (interval * 1000))
+ interval = delta / 1000L;
+ // offset: a fraction (sync_delta_rate) of the difference between the
last modification time, and the last fetch time.
+ long offset = Math.round(delta * SYNC_DELTA_RATE);
+ long maxIntervalMillis = (long) maxInterval * 1000L;
+ LOG.trace("delta (days): " + Duration.ofMillis(delta).toDays()
Review Comment:
Especially for debug and trace logs, parameterized logging is recommended.
See the [slf4j FAQ about
performance](https://www.slf4j.org/faq.html#logging_performance).
However, because there are three Duration objects created, it's also ok to
put the log call into a `if (LOG.isTraceEnabled())` condition.
##########
src/java/org/apache/nutch/fetcher/FetcherThread.java:
##########
@@ -389,7 +389,7 @@ public void run() {
}
continue;
}
- if (!rules.isAllowed(fit.u)) {
Review Comment:
How is this change related to NUTCH-1564?
It reverts a change done in PR #874 / NUTCH-3136. Possibly a rebase issue?
> AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not
> modified
> -------------------------------------------------------------------------------------
>
> Key: NUTCH-1564
> URL: https://issues.apache.org/jira/browse/NUTCH-1564
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.6, 2.1
> Reporter: Sebastian Nagel
> Assignee: Isabelle Giguere
> Priority: Critical
> Fix For: 1.22
>
>
> In a continuous crawl with adaptive fetch scheduling documents not modified
> for a longer time may be fetched in every cycle.
> A continous crawl is run daily with a 3 cycles and the following scheduling
> intervals (freshness matters):
> {code}
> db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
> db.fetch.schedule.adaptive.sync_delta = true (default)
> db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
> db.fetch.interval.default = 172800 (2 days)
> db.fetch.schedule.adaptive.min_interval = 86400 (1 day)
> db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
> db.fetch.interval.max = 604800 (7 days)
> {code}
> At Apr 18 a URL is generated and fetched (from segment dump):
> {code}
> Crawl Generate::
> Status: 2 (db_fetched)
> Fetch time: Mon Apr 15 19:43:22 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> Crawl Fetch::
> Status: 33 (fetch_success)
> Fetch time: Thu Apr 18 01:23:51 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> {code}
> Running CrawlDb update results in a next fetch time in the past (which forces
> an immediate refetch in the next cycle):
> {code}
> Status: 6 (db_notmodified)
> Fetch time: Tue Apr 16 01:37:00 CEST 2013
> Modified time: Tue Mar 19 01:07:42 CET 2013
> Retries since fetch: 0
> Retry interval: 604800 seconds (7 days)
> {code}
> This behavior is caused by the sync_delta calculation in
> AdaptiveFetchSchedule:
> {code}
> if (SYNC_DELTA) {
> // try to synchronize with the time of change
> long delta = (fetchTime - modifiedTime) / 1000L;
> if (delta > interval) interval = delta;
> refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
> }
> if (interval < MIN_INTERVAL) {
> interval = MIN_INTERVAL;
> } else if (interval > MAX_INTERVAL) {
> interval = MAX_INTERVAL;
> }
> ...
> datum.setFetchTime(refTime + Math.round(interval * 1000.0));
> {code}
> {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the
> past ({{delta}} * 0.3). After adding {{interval}} (adjusted to
> {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch "should" take place
> 2 days in the past (Apr 16).
> According to the
> [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html]
> (if understood right), there are two aims of the sync_delta if we know that
> a document hasn't been modified for long:
> * increase the fetch interval immediately (not step by step)
> * because we expect the document to be changed within the adaptive interval
> (but it hasn't), we shift the "reference time", i.e. we expect a change soon.
> These two aims are somehow in contradiction. In any case, the next fetch time
> should be always within the range of (currentFetchTime + MIN_INTERVAL) and
> (currentFetchTime + MAX_INTERVAL) and never in the past.
> This problem has been noted by [~pascaldimassimo] in
> [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and
> [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)