[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Sebastian Nagel (JIRA) Mon, 13 Jan 2014 16:10:19 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870141#comment-13870141
 ]


Sebastian Nagel commented on NUTCH-1113:
----------------------------------------

Hi [~markus17], you are right: my patch fails if the newest segment contains a 
linked datum (that's the case for both examples which still failed) which is 
passed first to reduce(). Any trials to force this ordering in a unit test by 
shuffling segments passed to merge() did not succeed. Seems that local mode is 
immune.
What about keeping the linked datum only if there is no fetch datum at all:
{code}
          if (lastF == null) {
            lastF = val;
            lastFname = sp.segmentName;
          } else {
            // only consider fetch status
            // https://issues.apache.org/jira/browse/NUTCH-1520
            if (CrawlDatum.hasFetchStatus(val)) {
              // take newer but always overwrite LINKED datum
              if (lastFname.compareTo(sp.segmentName) < 0
                  || ! CrawlDatum.hasFetchStatus(lastF)) {
                lastF = val;
                lastFname = sp.segmentName;
              }
            }
          }
{code}
Skipping all linked datums is also ok. We should just place a warning that 
segments should be used to update CrawlDb before merging them, otherwise links 
may get lost (partially that's already the case).

> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>            Priority: Blocker
>             Fix For: 1.9
>
>         Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Reply via email to