[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908265#comment-13908265 ] Sebastian Nagel edited comment on NUTCH-1113 at 2/28/14 2:45 PM: - Hi [~markus17], your patch should work (I've tested it exactly the same way). The indexer was run with {{indexer.skip.notmodified == false}}. The problem is that in the merged segment fetch_success datums have been lost and the following test skipped these URLs: {code} if (!parseData.getStatus().isSuccess() || fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { return; } {code} Just to clarify that we use the same test set-up: # start with an empty index # index (case A) segments in chronological order or (case B) merged segment # compare both indexes The CrawlDb was updated with URLs from all segments. The same CrawlDb is used for all index runs, right? I plan to run the test with {{indexer.skip.notmodified == false}}. Otherwise, the index will not contain any pages with status notmodified. was (Author: wastl-nagel): Hi [~markus17], your patch should work (I've tested it exactly the same way). The indexer was run with {{indexer.skip.notmodified == false}}. The problem is that in the merged segment fetch_success datums have been lost and the following test skipped these URLs: {code} if (!parseData.getStatus().isSuccess() || fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { return; } {code} Just to clarify that we use the same test set-up: # start with an empty index # index (case A) segments in chronological order or (case B) merged segment # compare both indexes The CrawlDb was updated with URLs from all segments. The same CrawlDb is used for all index runs, right? I plan to run the test with {{indexer.skip.notmodified == false}}. Otherwise, we the index will not contain any pages with status notmodified. > Merging segments causes URLs to vanish from crawldb/index? > -- > > Key: NUTCH-1113 > URL: https://issues.apache.org/jira/browse/NUTCH-1113 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Edward Drapkin >Priority: Blocker > Fix For: 1.9 > > Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, > merged_segment_output.txt, unmerged_segment_output.txt > > > When I run Nutch, I use the following steps: > nutch inject crawldb/ url.txt > repeated 3 times: > nutch generate crawldb/ segments/ -normalize > nutch fetch `ls -d segments/* | tail -1` > nutch parse `ls -d segments/* | tail -1` > nutch update crawldb `ls -d segments/* | tail -1` > nutch mergesegs merged/ -dir segments/ > nutch invertlinks linkdb/ -dir merged/ > nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene > indexing code from Nutch 1.1). > When I crawl with merging segments, I lose about 20% of the URLs that wind up > in the index vs. when I crawl without merging the segments. Somehow the > segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876036#comment-13876036 ] Sebastian Nagel edited comment on NUTCH-1113 at 1/19/14 11:36 PM: -- If (re)indexing multiple segments also IndexerMapReduce should take care about ordering of values: newer ones must overwrite older ones. But that's not done explicitly - it's just implicitly assumed that later values are newer. But if indexing is done segment by segment (in chronological order) this should be no problem. See NUTCH-1617. was (Author: wastl-nagel): If (re)indexing multiple segments also IndexerMapReduce should take care about ordering of values: newer ones must overwrite older ones. But that's not done explicitly - it's just implicitly assumed that later values are newer. But if indexing is done segment by segment (in chronological order) this should be no problem. > Merging segments causes URLs to vanish from crawldb/index? > -- > > Key: NUTCH-1113 > URL: https://issues.apache.org/jira/browse/NUTCH-1113 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Edward Drapkin >Priority: Blocker > Fix For: 1.9 > > Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt > > > When I run Nutch, I use the following steps: > nutch inject crawldb/ url.txt > repeated 3 times: > nutch generate crawldb/ segments/ -normalize > nutch fetch `ls -d segments/* | tail -1` > nutch parse `ls -d segments/* | tail -1` > nutch update crawldb `ls -d segments/* | tail -1` > nutch mergesegs merged/ -dir segments/ > nutch invertlinks linkdb/ -dir merged/ > nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene > indexing code from Nutch 1.1). > When I crawl with merging segments, I lose about 20% of the URLs that wind up > in the index vs. when I crawl without merging the segments. Somehow the > segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869439#comment-13869439 ] Markus Jelsma edited comment on NUTCH-1113 at 1/13/14 11:54 AM: Sebastian's patch does solve a few problems indeed but i still see issues. I have another Solr cluster loaded with merged data, that segment does not contain LINKED at all. I found a URL that is indexed on that cluster but not on the cluster i loaded used Sebastian's patch on. I've tracked the URL's down and found several segments. These are all the CrawlDatum's i have on that URL, in order: {code} Segment: 20131223125654 Segment: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 23 12:57:33 UTC 2013 Modified time: Thu Jan 01 00:00:00 UTC 1970 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Score: 0.0 Signature: null Metadata: _ngt_=1387803327158 Content-Type=text/html _pst_=temp_moved(13), lastModified=0: https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html _repr_=http://www.example.org/shirt14-15/ Segment: 20131227011558 Segment: Version: 7 Status: 33 (fetch_success) Fetch time: Fri Dec 27 01:19:38 UTC 2013 Modified time: Mon Dec 16 13:12:43 UTC 2013 Retries since fetch: 0 Retry interval: 907200 seconds (10 days) Score: 0.0 Signature: 496376acd5a9d9e26cd4c54078ea1f1b Metadata: _ngt_=1388106860582 Content-Type=application/xhtml+xml _pst_=success(1), lastModified=0 _repr_=http://www.example.org/shirt14-15/ Segment: 20131230130048 Segment: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 30 13:02:20 UTC 2013 Modified time: Thu Jan 01 00:00:00 UTC 1970 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Score: 0.0 Signature: null Metadata: _ngt_=1388408342373 Content-Type=text/html _pst_=temp_moved(13), lastModified=0: https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html _repr_=http://www.example.org/shirt14-15/ Segment: 20140106131930 Segment: Version: 7 Status: 67 (linked) Fetch time: Mon Jan 06 13:20:21 UTC 2014 Modified time: Thu Jan 01 00:00:00 UTC 1970 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Score: 0.0 Signature: null Metadata: _ngt_=1389014244859 Content-Type=text/html _pst_=temp_moved(13), lastModified=0: https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html _repr_=http://www.example.org/shirt14-15/ {code} This URL is indexed correctly by NUTCH-1113, NUTCH-1616 and the merged segment without LINKED and current trunk. Sebastian's patch does not index this document but does fix it for other records. was (Author: markus17): Sebastian's patch does solve a few problems indeed but i still see issues. I have another Solr cluster loaded with merged data, that segment does not contain LINKED at all. I found a URL that is indexed on that cluster but not on the cluster i loaded used Sebastian's patch on. I've tracked the URL's down and found several segments. These are all the CrawlDatum's i have on that URL, in order: {code} Segment: 20131223125654 Segment: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 23 12:57:33 UTC 2013 Modified time: Thu Jan 01 00:00:00 UTC 1970 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Score: 0.0 Signature: null Metadata: _ngt_=1387803327158 Content-Type=text/html _pst_=temp_moved(13), lastModified=0: https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html _repr_=http://www.example.org/shirt14-15/ Segment: 20131227011558 Segment: Version: 7 Status: 33 (fetch_success) Fetch time: Fri Dec 27 01:19:38 UTC 2013 Modified time: Mon Dec 16 13:12:43 UTC 2013 Retries since fetch: 0 Retry interval: 907200 seconds (10 days) Score: 0.0 Signature: 496376acd5a9d9e26cd4c54078ea1f1b Metadata: _ngt_=1388106860582 Content-Type=application/xhtml+xml _pst_=success(1), lastModified=0 _repr_=http://www.example.org/shirt14-15/ Segment: 20131230130048 Segment: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 30 13:02:20 UTC 2013 Modified time: Thu Jan 01 00:00:00 UTC 1970 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Score: 0.0 Signature: null Metadata: _ngt_=1388408342373 Content-Type=text/html _pst_=temp_moved(13), lastModified=0: https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html _repr_=http://www.example.org/shirt14-15/ Segment: 20140106131930 Segment: Version: 7 Status: 67 (linked) Fetch time: Mon Jan 06 13:20:21 UTC 2014 Modified time: Thu Jan 01 00:00:00 UTC 1970 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Score: 0.0 Signature: null Metadata: _ngt_=1389014244859 Content-Type=text/html _pst_=temp_moved(13), lastModified=0: https://shop.exampl
[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866789#comment-13866789 ] Markus Jelsma edited comment on NUTCH-1113 at 1/9/14 4:52 PM: -- New patch! Previous patch had an error in the checks. With this patch, everything passes on trunk and NUTCH-1113. So, now it seems i haven't been able to reproduce a problem in tests! :( was (Author: markus17): New patch! Previous patch had an error in the checks. With this patch, everything passes on trunk and NUTCH-1113. So, no it seems i haven't been able to reproduce a problem in tests! :( > Merging segments causes URLs to vanish from crawldb/index? > -- > > Key: NUTCH-1113 > URL: https://issues.apache.org/jira/browse/NUTCH-1113 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Edward Drapkin >Priority: Blocker > Fix For: 1.9 > > Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, > NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, > unmerged_segment_output.txt > > > When I run Nutch, I use the following steps: > nutch inject crawldb/ url.txt > repeated 3 times: > nutch generate crawldb/ segments/ -normalize > nutch fetch `ls -d segments/* | tail -1` > nutch parse `ls -d segments/* | tail -1` > nutch update crawldb `ls -d segments/* | tail -1` > nutch mergesegs merged/ -dir segments/ > nutch invertlinks linkdb/ -dir merged/ > nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene > indexing code from Nutch 1.1). > When I crawl with merging segments, I lose about 20% of the URLs that wind up > in the index vs. when I crawl without merging the segments. Somehow the > segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)