[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105758#comment-13105758
 ] 

Edward Drapkin commented on NUTCH-1113:
---------------------------------------

The more I look into this, the more I'm certain that the assumption made my the 
segment merger (lexicographically sorting the segments ascending by name and 
assuming the order is the order of priority) is invalid.  

I added a quick check to the segment merger (on line 419 of SegmentMerger.java) 
and changed:

if (lastFname.compareTo(sp.segmentName) < 0) {

to:

if (lastFname.compareTo(sp.segmentName) < 0 && !(val.getStatus() == 
CrawlDatum.STATUS_LINKED && lastF.getStatus() == 
CrawlDatum.STATUS_FETCH_SUCCESS)) {

And lo and behold, the problem no longer occurs //for this case//.  I'm 
hesitant to say that this is a fix to this bug, because I think that this is 
just one case where the reality of the situation is different than 
SegmentMerger's assumption.  I can't say conclusively if there are more 
situations, but I have a hunch that there are.  Polluting the code with 
incomprehensible logical checks like this isn't the solution, not unless this 
is the only case.  That is to say that I think that SegmentMerger needs to be 
fundamentally rethought because it seems this basic assumption that it makes 
invalidates its usefulness.  I'd be happy to do the rewrite (as I do think its 
necessary, the onus is on me to do it), but I'm not sure if I'm quite enough 
familiar with Nutch's segment format.  If someone who is familiar with Nutch's 
segment format (and the SegmentMerger too, preferably) can get in touch with me 
and lend me their ear for questions, I'd be happy to do this.

> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>             Fix For: 1.4
>
>         Attachments: merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to