[ 
https://issues.apache.org/jira/browse/NUTCH-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559453#comment-16559453
 ] 

Sebastian Nagel commented on NUTCH-2623:
----------------------------------------

Delays have been handled on the protocol level in prior versions of Nutch, that 
is by different protocol plugins for http resp. https. That's probably the 
reason why the later implementation in Fetcher adds the protocol as part of the 
queue ID. Does it make sense to keep this today, when http and https requests 
are likely to be processed by the same server?

> Fetcher to guarantee delay for same host/domain/ip independent of http/https 
> protocol
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2623
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2623
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
>  Fetcher uses a combination of protocol and host/domain/ip as ID for fetch 
> item queues, see 
> [FetchItem.java|https://github.com/apache/nutch/blob/2b93a66/src/java/org/apache/nutch/fetcher/FetchItem.java#L101].
>  This inhibits a guaranteed delay, in case both http:// and https:// URLs are 
> fetched from the same host/domain/ip, e.g. here with a large delay of 30 sec.:
> {noformat}
> 2018-07-23 14:54:39,834 INFO fetcher.FetcherThread - FetcherThread 24 
> fetching http://nutch.apache.org/ (queue crawl delay=30000ms)
> 2018-07-23 14:54:39,846 INFO fetcher.FetcherThread - FetcherThread 23 
> fetching https://nutch.apache.org/ (queue crawl delay=30000ms)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to