[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Markus Jelsma (JIRA) Thu, 09 Jan 2014 07:37:54 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Jelsma updated NUTCH-1113:
---------------------------------

    Attachment: NUTCH-1113-junit.patch

Alright, manual testing did not go very well and it takes hours and is too 
cumbersome so i cooked up a unit test for these issues. It also includes a 
failed attempt to make SegmentMerger implement Tool and also includes commented 
out versions of current trunk, NUTCH-1616 and NUTCH-1113 (single lines though).

There are two unit tests based on some randomized set of segments with a record 
with a random status. testRandomTestSequence() fails on current trunk but NOT 
with NUTCH-1113. testRandomTestSequenceWithRedirects() always fails! The latter 
injects redirections in the set of random records, this is the issue we must 
fix somehow.

There may be a problem with how i inject those redirects but i think i got it 
right. If there's someone here able or willing to help out then i'd be really 
happy, this issue haunted Nutch from the beginning and must be dealt with! 
Preferably before we release 1.8!

Thanks,
Markus

> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>             Fix For: 1.9
>
>         Attachments: NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, 
> merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Reply via email to