[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1113: --------------------------------- Fix Version/s: (was: 1.4) 1.5 > Merging segments causes URLs to vanish from crawldb/index? > ---------------------------------------------------------- > > Key: NUTCH-1113 > URL: https://issues.apache.org/jira/browse/NUTCH-1113 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.3 > Reporter: Edward Drapkin > Fix For: 1.5 > > Attachments: merged_segment_output.txt, unmerged_segment_output.txt > > > When I run Nutch, I use the following steps: > nutch inject crawldb/ url.txt > repeated 3 times: > nutch generate crawldb/ segments/ -normalize > nutch fetch `ls -d segments/* | tail -1` > nutch parse `ls -d segments/* | tail -1` > nutch update crawldb `ls -d segments/* | tail -1` > nutch mergesegs merged/ -dir segments/ > nutch invertlinks linkdb/ -dir merged/ > nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene > indexing code from Nutch 1.1). > When I crawl with merging segments, I lose about 20% of the URLs that wind up > in the index vs. when I crawl without merging the segments. Somehow the > segment merger causes me to lose ~20% of my crawl database! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira