[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Sebastian Nagel (Commented) (JIRA) Wed, 25 Jan 2012 08:43:11 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193115#comment-13193115
 ]


Sebastian Nagel commented on NUTCH-1113:
----------------------------------------

I had a look at the attached segment dumps: the merged data is by far larger 
than the unmerged one.
And there are 739 identical linked CrawlDatum objects. Maybe this is an 
artifact of NUTCH-1252?

{quote}
it seems that all pages that:

A) Have already been fetched *AND
B) Are set as the location of a redirect in subsequent iterations through the 
crawl process

will be "lost" after a segment merge.
{quote}
I've run into the same situation.

{quote}
SegmentMerger makes the assumption that all values from the newest segment are 
preferable, so in this case, there's a crawl_fetch segment piece for this URL 
in two segments. In the first segment, it's marked Status 33 (fetch success) 
and in the second segment, it's marked Status 67 (linked), so the status 67 
overwrites the status 33 crawl_fetch segment piece. From there, the URL data is 
excluded (correctly) from the index, because it's not marked as fetch success.
{quote}
I think this assumption is ok but it is necessary to preserve more than one 
(the latest) CrawlDatum:
# at least the latest out of {FETCH_SUCCESS, FETCH_GONE, FETCH_RETRY, 
FETCH_REDIR*}
# eventually the latest of FETCH_NOTMODIFIED (when re-indexing all segments 
IndexerMapReduce does not index documents with only a FETCH_NOTMODIFIED)
# possibly all linked CrawlDatums in crawl_fetch of the latest segment 
(similarily to those in crawl_parse)
                
> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>             Fix For: 1.5
>
>         Attachments: merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Reply via email to