[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Edward Drapkin (JIRA) Thu, 15 Sep 2011 14:21:32 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105693#comment-13105693
 ]


Edward Drapkin commented on NUTCH-1113:
---------------------------------------

Using this command:

nutch readseg -get merged/20110915145111/  
"http://www.wolfram.com/mathematica/"; -nocontent -noparsetext -noparse 

and this command:

readseg -get segments/20110915144153/ "http://www.wolfram.com/mathematica/"; 
-nocontent -noparsetext -noparse 

there are some differences...

The most obvious difference is this:

== Unmerged segment: 
Crawl Fetch::
Version: 7
Status: 33 (fetch_success)
Fetch time: Thu Sep 15 14:41:57 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 150000000 seconds (1736 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1316115711544_pst_: success(1), lastModified=0


== Merged segment:
Crawl Fetch::
Version: 7
Status: 67 (linked)
Fetch time: Thu Sep 15 14:43:30 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 150000000 seconds (1736 days)
Score: 0.0
Signature: null
Metadata: _ngt_: 1316115784243_pst_: moved(12), lastModified=0: 
http://www.wolfram.com/mathematica/_repr_: http://www.wolfram.com/mathematica/

I attached the full -get output from both segments for the URL in question 
(it's a URL I know vanishes after merging segments).



> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>         Attachments: merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Reply via email to