[
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1113:
---------------------------------
Attachment: NUTCH-1113-junit.patch
Alright, manual testing did not go very well and it takes hours and is too
cumbersome so i cooked up a unit test for these issues. It also includes a
failed attempt to make SegmentMerger implement Tool and also includes commented
out versions of current trunk, NUTCH-1616 and NUTCH-1113 (single lines though).
There are two unit tests based on some randomized set of segments with a record
with a random status. testRandomTestSequence() fails on current trunk but NOT
with NUTCH-1113. testRandomTestSequenceWithRedirects() always fails! The latter
injects redirections in the set of random records, this is the issue we must
fix somehow.
There may be a problem with how i inject those redirects but i think i got it
right. If there's someone here able or willing to help out then i'd be really
happy, this issue haunted Nutch from the beginning and must be dealt with!
Preferably before we release 1.8!
Thanks,
Markus
> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.3
> Reporter: Edward Drapkin
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch,
> merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up
> in the index vs. when I crawl without merging the segments. Somehow the
> segment merger causes me to lose ~20% of my crawl database!
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)