[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Sebastian Nagel (JIRA) Thu, 06 Mar 2014 14:00:09 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel updated NUTCH-1113:
-----------------------------------

    Attachment: NUTCH-1113-trunk-junit-fail.patch

Fixed also second problem in junit test: segments except the first one may be 
empty at random. We must ensure that at least one CrawlDatum (linked or fetch) 
are in the segment.
With this patch junit tests now pass.

> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.8
>
>         Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-fail.patch, 
> NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, 
> NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Reply via email to