[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Markus Jelsma (JIRA) Wed, 22 Jan 2014 01:25:07 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Jelsma updated NUTCH-1113:
---------------------------------

    Attachment: NUTCH-1113-junit.patch

Attached patch seems to completely fix the issue, finally!
* does not merge LINKED status
* does not merge fetch_retry status
* considers latest fetch datum

Anyone here to confirm the result? To do so you must have a lot of segments, at 
least so many that the whole bunch contains a good number of url's that have 
been refetched in the mean time. You need to index those segments in 
chronological order segments by segment (not input them all in the indexer via 
-dir, that is still a bug). You should also then merge the segments with this 
patch and index the merged segment.

The number of indexed documents should be the same.

> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>            Priority: Blocker
>             Fix For: 1.9
>
>         Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
> unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Reply via email to