[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1113: --- Attachment: NUTCH-1113-trunk-junit-fail.patch Fixed also second problem in junit test: segments except the first one may be empty at random. We must ensure that at least one CrawlDatum (linked or fetch) are in the segment. With this patch junit tests now pass. Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Assignee: Markus Jelsma Priority: Blocker Fix For: 1.8 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-fail.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1113: --- Attachment: (was: NUTCH-1113-trunk-junit-fail.patch) Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Assignee: Markus Jelsma Priority: Blocker Fix For: 1.8 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-fail.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1113: --- Attachment: NUTCH-1113-trunk-junit-fail.patch Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Assignee: Markus Jelsma Priority: Blocker Fix For: 1.8 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-fail.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-trunk-junit-final.patch Final patch including the stuff mentioned by Sebastian and the junit test. I will commit shortly unless there are some final objections :) Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Fix Version/s: (was: 1.9) 1.8 Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Assignee: Markus Jelsma Priority: Blocker Fix For: 1.8 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-trunk.patch Includes STATUS_FETCH_NOTMODIFIED in the check. But are you sure that this is the problem? We also have a lot of NOT_MODIFIED so i think your indexer just skips NOT_MODIFIED, so there should be no problem, right? can you check? Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-junit.patch Attached patch seems to completely fix the issue, finally! * does not merge LINKED status * does not merge fetch_retry status * considers latest fetch datum Anyone here to confirm the result? To do so you must have a lot of segments, at least so many that the whole bunch contains a good number of url's that have been refetched in the mean time. You need to index those segments in chronological order segments by segment (not input them all in the indexer via -dir, that is still a bug). You should also then merge the segments with this patch and index the merged segment. The number of indexed documents should be the same. Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-junit.patch Slightly updated patch. I have no merged and indexed a large number of segments with trunk, NUTCH-1616 and NUTCH-1113 and none of them is really correct. Both NUTCH-1113 and trunk give reasonable results but there's is always a few records missing. So far i have been unable to reproduce it in a controlled environment nor in unit tests and the data is have is too much to look through. Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1113: --- Attachment: NUTCH-1113-junit.patch * extended Junit test to fail if both linked and fetch datum are contained in the same segment * fix for this problem in SegmentMerger Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-junit.patch Alright, manual testing did not go very well and it takes hours and is too cumbersome so i cooked up a unit test for these issues. It also includes a failed attempt to make SegmentMerger implement Tool and also includes commented out versions of current trunk, NUTCH-1616 and NUTCH-1113 (single lines though). There are two unit tests based on some randomized set of segments with a record with a random status. testRandomTestSequence() fails on current trunk but NOT with NUTCH-1113. testRandomTestSequenceWithRedirects() always fails! The latter injects redirections in the set of random records, this is the issue we must fix somehow. There may be a problem with how i inject those redirects but i think i got it right. If there's someone here able or willing to help out then i'd be really happy, this issue haunted Nutch from the beginning and must be dealt with! Preferably before we release 1.8! Thanks, Markus Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Priority: Blocker (was: Major) Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-junit.patch New patch that actually works for Apache Nutch current trunk. This does not include the futile attempt to implement Tool, only the commented out lines of trunk, NUTCH-1113 and NUTCH-1616. Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-junit.patch New patch! Previous patch had an error in the checks. With this patch, everything passes on trunk and NUTCH-1113. So, no it seems i haven't been able to reproduce a problem in tests! :( Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-trunk.patch Patch for trunk with Edward's fix. That fix at least solves a problem i introduced ni our own build with NUTCH-1616. Sebastian, can you please shed some additional light here? Regarding your latest comment 2: IndexerMapReduce only optionally skips NOT_MODIFIED records. When doing a full reindex you of course disable that feature. Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Fix For: 1.9 Attachments: NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Fix For: 1.6 Attachments: merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Fix Version/s: (was: 1.4) 1.5 Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Fix For: 1.5 Attachments: merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Drapkin updated NUTCH-1113: -- Attachment: merged_segment_output.txt unmerged_segment_output.txt Output for segreader -get for a URL that disappears after merging segments. Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Attachments: merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Fix Version/s: 1.4 Thanks! It's marked for 1.4 now so it, at least, doesn't slip of the radar. Can you provide a patch or debug report? That would be helpful. Thanks Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Fix For: 1.4 Attachments: merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira