[
https://issues.apache.org/jira/browse/NUTCH-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633422#comment-16633422
]
ASF GitHub Bot commented on NUTCH-2623:
---------------------------------------
sebastian-nagel commented on issue #369: NUTCH-2623 Fetcher to guarantee delay
for same host/domain/ip independent of http/https protocol
URL: https://github.com/apache/nutch/pull/369#issuecomment-425733714
Updated and removed legacy mode `byHostProtocol` as discussed in NUTCH-2623.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Fetcher to guarantee delay for same host/domain/ip independent of http/https
> protocol
> -------------------------------------------------------------------------------------
>
> Key: NUTCH-2623
> URL: https://issues.apache.org/jira/browse/NUTCH-2623
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.14
> Reporter: Sebastian Nagel
> Priority: Minor
> Fix For: 1.16
>
>
> Fetcher uses a combination of protocol and host/domain/ip as ID for fetch
> item queues, see
> [FetchItem.java|https://github.com/apache/nutch/blob/2b93a66/src/java/org/apache/nutch/fetcher/FetchItem.java#L101].
> This inhibits a guaranteed delay, in case both http:// and https:// URLs are
> fetched from the same host/domain/ip, e.g. here with a large delay of 30 sec.:
> {noformat}
> 2018-07-23 14:54:39,834 INFO fetcher.FetcherThread - FetcherThread 24
> fetching http://nutch.apache.org/ (queue crawl delay=30000ms)
> 2018-07-23 14:54:39,846 INFO fetcher.FetcherThread - FetcherThread 23
> fetching https://nutch.apache.org/ (queue crawl delay=30000ms)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)