[ 
https://issues.apache.org/jira/browse/NUTCH-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435854#comment-16435854
 ] 

ASF GitHub Bot commented on NUTCH-2553:
---------------------------------------

sebastian-nagel opened a new pull request #317: NUTCH-2553 Fetcher not to 
modify URLs to be fetched
URL: https://github.com/apache/nutch/pull/317
 
 
   - fix bug in fetcher.QueueFeeder which caused the same key-value pair to be 
overwritten again and again (Hadoop object reuse)
   - simplify URL handling in FetcherThread: hold URLs exclusively in FetchItem
   - parametrize log messages and remove unused imports and variables

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fetcher not to modify URLs to be fetched
> ----------------------------------------
>
>                 Key: NUTCH-2553
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2553
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.15
>
>
> Fetcher modifies the URLs being fetched (introduced with NUTCH-2375 in 
> [c93d908|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#diff-847479d08597eb30da1c715310438685R253]:
> {noformat}
> FetcherThread 22 fetching http://nutch.apache.org:-1/ (queue crawl 
> delay=5000ms)
> {noformat}
> which makes it hard to trace the URLs in the log files and likely causes 
> other issues because URLs in CrawlDb and segments do not match 
> (http://nutch.apache.org/ in CrawlDb and http://nutch.apache.org:-1/ in 
> segment).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to