Merging segments causes URLs to vanish from crawldb/index?
----------------------------------------------------------
Key: NUTCH-1113
URL: https://issues.apache.org/jira/browse/NUTCH-1113
Project: Nutch
Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
When I run Nutch, I use the following steps:
nutch inject crawldb/ url.txt
repeated 3 times:
nutch generate crawldb/ segments/ -normalize
nutch fetch `ls -d segments/* | tail -1`
nutch parse `ls -d segments/* | tail -1`
nutch update crawldb `ls -d segments/* | tail -1`
nutch mergesegs merged/ -dir segments/
nutch invertlinks linkdb/ -dir merged/
nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene
indexing code from Nutch 1.1).
When I crawl with merging segments, I lose about 20% of the URLs that wind up
in the index vs. when I crawl without merging the segments. Somehow the
segment merger causes me to lose ~20% of my crawl database!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira