[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-03-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1113:
---

Attachment: NUTCH-1113-trunk-junit-fail.patch

Fixed also second problem in junit test: segments except the first one may be 
empty at random. We must ensure that at least one CrawlDatum (linked or fetch) 
are in the segment.
With this patch junit tests now pass.

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.8

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-fail.patch, 
 NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, 
 NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-03-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1113:
---

Attachment: (was: NUTCH-1113-trunk-junit-fail.patch)

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.8

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-fail.patch, 
 NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, 
 NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-03-01 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1113:
---

Attachment: NUTCH-1113-trunk-junit-fail.patch

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.8

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-fail.patch, 
 NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, 
 NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-trunk-junit-final.patch

Final patch including the stuff mentioned by Sebastian and the junit test. I 
will commit shortly unless there are some final objections :)

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, 
 NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Fix Version/s: (was: 1.9)
   1.8

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.8

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, 
 NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-21 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-trunk.patch

Includes STATUS_FETCH_NOTMODIFIED in the check. But are you sure that this is 
the problem? We also have a lot of NOT_MODIFIED so i think your indexer just 
skips NOT_MODIFIED, so there should be no problem, right? can you check?

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, 
 merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-junit.patch

Attached patch seems to completely fix the issue, finally!
* does not merge LINKED status
* does not merge fetch_retry status
* considers latest fetch datum

Anyone here to confirm the result? To do so you must have a lot of segments, at 
least so many that the whole bunch contains a good number of url's that have 
been refetched in the mean time. You need to index those segments in 
chronological order segments by segment (not input them all in the indexer via 
-dir, that is still a bug). You should also then merge the segments with this 
patch and index the merged segment.

The number of indexed documents should be the same.

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-10 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-junit.patch

Slightly updated patch. I have no merged and indexed a large number of segments 
with trunk, NUTCH-1616 and NUTCH-1113 and none of them is really correct. Both 
NUTCH-1113 and trunk give reasonable results but there's is always a few 
records missing. 

So far i have been unable to reproduce it in a controlled environment nor in 
unit tests and the data is have is too much to look through.

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, 
 merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1113:
---

Attachment: NUTCH-1113-junit.patch

* extended Junit test to fail if both linked and fetch datum are contained in 
the same segment
* fix for this problem in SegmentMerger

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-09 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-junit.patch

Alright, manual testing did not go very well and it takes hours and is too 
cumbersome so i cooked up a unit test for these issues. It also includes a 
failed attempt to make SegmentMerger implement Tool and also includes commented 
out versions of current trunk, NUTCH-1616 and NUTCH-1113 (single lines though).

There are two unit tests based on some randomized set of segments with a record 
with a random status. testRandomTestSequence() fails on current trunk but NOT 
with NUTCH-1113. testRandomTestSequenceWithRedirects() always fails! The latter 
injects redirections in the set of random records, this is the issue we must 
fix somehow.

There may be a problem with how i inject those redirects but i think i got it 
right. If there's someone here able or willing to help out then i'd be really 
happy, this issue haunted Nutch from the beginning and must be dealt with! 
Preferably before we release 1.8!

Thanks,
Markus

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, 
 merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-09 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Priority: Blocker  (was: Major)

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, 
 merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-09 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-junit.patch

New patch that actually works for Apache Nutch current trunk. This does not 
include the futile attempt to implement Tool, only the commented out lines of 
trunk, NUTCH-1113 and NUTCH-1616.



 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-09 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-junit.patch

New patch! Previous patch had an error in the checks. With this patch, 
everything passes on trunk and NUTCH-1113.

So, no it seems i haven't been able to reproduce a problem in tests!  :(

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-trunk.patch

Patch for trunk with Edward's fix. That fix at least solves a problem i 
introduced ni our own build with NUTCH-1616. Sebastian, can you please shed 
some additional light here?

Regarding your latest comment 2: IndexerMapReduce only optionally skips 
NOT_MODIFIED records. When doing a full reindex you of course disable that 
feature.

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
 Fix For: 1.9

 Attachments: NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
 Fix For: 1.6

 Attachments: merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2011-09-28 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Fix Version/s: (was: 1.4)
   1.5

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
 Fix For: 1.5

 Attachments: merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2011-09-15 Thread Edward Drapkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Drapkin updated NUTCH-1113:
--

Attachment: merged_segment_output.txt
unmerged_segment_output.txt

Output for segreader -get for a URL that disappears after merging segments.

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
 Attachments: merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2011-09-15 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Fix Version/s: 1.4

Thanks! It's marked for 1.4 now so it, at least, doesn't slip of the radar. Can 
you provide a patch or debug report? That would be helpful.

Thanks

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
 Fix For: 1.4

 Attachments: merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira