[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105714#comment-13105714
 ] 

Edward Drapkin edited comment on NUTCH-1113 at 9/15/11 9:52 PM:
----------------------------------------------------------------

Upon further inspection, it appears that SegmentMerger makes the assumption 
that all values from the newest segment are preferable, so in this case, 
there's a crawl_fetch segment piece for this URL in two segments.  In the first 
segment, it's marked Status 33 (fetch success) and in the second segment, it's 
marked Status 67 (linked), so the status 67 overwrites the status 33 
crawl_fetch segment piece.  From there, the URL data is excluded (correctly) 
from the index, because it's not marked as fetch success.

It appears that the URL gets added to later segments with status 67 because 
it's the location of an HTTP 301 redirect.  After the URL is fetched properly 
the first time, a page link to http://www.wolfram.com/products/mathematica/ 
gets found, which redirects to the URL in question, causing that entry in the 
segment data.  Because http://www.wolfram.com/mathematica/ has already been 
fetched, it won't get added with the correct status (33) in future segments.

That is to say it seems that all pages that:

A) Have already been fetched **AND*
B) Are set as the location of a redirect in subsequent iterations through the 
crawl process

will be "lost" after a segment merge.

      was (Author: edwardd):
    Upon further inspection, it appears that SegmentMerger makes the assumption 
that all values from the newest segment are preferable, so in this case, 
there's a crawl_fetch segment piece for this URL in two segments.  In the first 
segment, it's marked Status 33 (fetch success) and in the second segment, it's 
marked Status 67 (linked), so the status 67 overwrites the status 33 
crawl_fetch segment piece.  From there, the URL data is excluded (correctly) 
from the index, because it's not marked as fetch success.

It appears that the URL gets added to later segments with status 67 because 
it's the location of an HTTP 301 redirect.  After the URL is fetched properly 
the first time, a page link to http://www.wolfram.com/products/mathematica/ 
gets found, which redirects to the URL in question, causing that entry in the 
segment data.  Because http://www.wolfram.com/mathematica/ has already been 
fetched, it won't get added with the correct status (33) in future segments.

That is to say it seems that all pages that:

A) Have already been fetched -AND-
B) Are set as the location of a redirect in subsequent iterations through the 
crawl process

will be "lost" after a segment merge.
  
> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>             Fix For: 1.4
>
>         Attachments: merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to