[
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870690#comment-13870690
]
Markus Jelsma commented on NUTCH-1113:
--------------------------------------
This works too, but if we ditch most LINKED datums anyway, then why don't we
ditch them all? We can indeed not update properly or rebuild the LinkDB or the
webgraph's OutlinkDB with a merged segment but this was always true. I don't
really care about that when merging the segments, the most important is that
when reindexing you get a stable output.
I'm okay with your approach as well as ditching them all. The difference in
size is negligibe, mine is 0.0125% smaller, both contain 1113440 records.
> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.3
> Reporter: Edward Drapkin
> Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch,
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch,
> NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up
> in the index vs. when I crawl without merging the segments. Somehow the
> segment merger causes me to lose ~20% of my crawl database!
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)