[
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105693#comment-13105693
]
Edward Drapkin commented on NUTCH-1113:
---------------------------------------
Using this command:
nutch readseg -get merged/20110915145111/
"http://www.wolfram.com/mathematica/" -nocontent -noparsetext -noparse
and this command:
readseg -get segments/20110915144153/ "http://www.wolfram.com/mathematica/"
-nocontent -noparsetext -noparse
there are some differences...
The most obvious difference is this:
== Unmerged segment:
Crawl Fetch::
Version: 7
Status: 33 (fetch_success)
Fetch time: Thu Sep 15 14:41:57 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 150000000 seconds (1736 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1316115711544_pst_: success(1), lastModified=0
== Merged segment:
Crawl Fetch::
Version: 7
Status: 67 (linked)
Fetch time: Thu Sep 15 14:43:30 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 150000000 seconds (1736 days)
Score: 0.0
Signature: null
Metadata: _ngt_: 1316115784243_pst_: moved(12), lastModified=0:
http://www.wolfram.com/mathematica/_repr_: http://www.wolfram.com/mathematica/
I attached the full -get output from both segments for the URL in question
(it's a URL I know vanishes after merging segments).
> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.3
> Reporter: Edward Drapkin
> Attachments: merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up
> in the index vs. when I crawl without merging the segments. Somehow the
> segment merger causes me to lose ~20% of my crawl database!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira