[
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1113:
---------------------------------
Attachment: NUTCH-1113-junit.patch
Attached patch seems to completely fix the issue, finally!
* does not merge LINKED status
* does not merge fetch_retry status
* considers latest fetch datum
Anyone here to confirm the result? To do so you must have a lot of segments, at
least so many that the whole bunch contains a good number of url's that have
been refetched in the mean time. You need to index those segments in
chronological order segments by segment (not input them all in the indexer via
-dir, that is still a bug). You should also then merge the segments with this
patch and index the merged segment.
The number of indexed documents should be the same.
> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.3
> Reporter: Edward Drapkin
> Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch,
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch,
> NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt,
> unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up
> in the index vs. when I crawl without merging the segments. Somehow the
> segment merger causes me to lose ~20% of my crawl database!
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)