[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908265#comment-13908265
 ] 

Sebastian Nagel edited comment on NUTCH-1113 at 2/28/14 2:45 PM:
-

Hi [~markus17], your patch should work (I've tested it exactly the same way).

The indexer was run with {{indexer.skip.notmodified == false}}. The problem is 
that in the merged segment fetch_success datums have been lost and the 
following test skipped these URLs:
{code}
if (!parseData.getStatus().isSuccess() ||
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
  return;
}
{code}
Just to clarify that we use the same test set-up:
# start with an empty index
# index (case A) segments in chronological order or (case B) merged segment
# compare both indexes

The CrawlDb was updated with URLs from all segments. The same CrawlDb is used 
for all index runs, right?

I plan to run the test with {{indexer.skip.notmodified == false}}. Otherwise, 
the index will not contain any pages with status notmodified.


was (Author: wastl-nagel):
Hi [~markus17], your patch should work (I've tested it exactly the same way).

The indexer was run with {{indexer.skip.notmodified == false}}. The problem is 
that in the merged segment fetch_success datums have been lost and the 
following test skipped these URLs:
{code}
if (!parseData.getStatus().isSuccess() ||
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
  return;
}
{code}
Just to clarify that we use the same test set-up:
# start with an empty index
# index (case A) segments in chronological order or (case B) merged segment
# compare both indexes
The CrawlDb was updated with URLs from all segments. The same CrawlDb is used 
for all index runs, right?

I plan to run the test with {{indexer.skip.notmodified == false}}. Otherwise, 
we the index will not contain any pages with status notmodified.

> Merging segments causes URLs to vanish from crawldb/index?
> --
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Edward Drapkin
>Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, 
> merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-19 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876036#comment-13876036
 ] 

Sebastian Nagel edited comment on NUTCH-1113 at 1/19/14 11:36 PM:
--

If (re)indexing multiple segments also IndexerMapReduce should take care about 
ordering of values: newer ones must overwrite older ones. But that's not done 
explicitly - it's just implicitly assumed that later values are newer. But if 
indexing is done segment by segment (in chronological order) this should be no 
problem. See NUTCH-1617.


was (Author: wastl-nagel):
If (re)indexing multiple segments also IndexerMapReduce should take care about 
ordering of values: newer ones must overwrite older ones. But that's not done 
explicitly - it's just implicitly assumed that later values are newer. But if 
indexing is done segment by segment (in chronological order) this should be no 
problem.

> Merging segments causes URLs to vanish from crawldb/index?
> --
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Edward Drapkin
>Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-13 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869439#comment-13869439
 ] 

Markus Jelsma edited comment on NUTCH-1113 at 1/13/14 11:54 AM:


Sebastian's patch does solve a few problems indeed but i still see issues. I 
have another Solr cluster loaded with merged data, that segment does not 
contain LINKED at all. I found a URL that is indexed on that cluster but not on 
the cluster i loaded used Sebastian's patch on. I've tracked the URL's down and 
found several segments. These are all the CrawlDatum's i have on that URL, in 
order:

{code}
Segment: 20131223125654
Segment: Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 23 12:57:33 UTC 2013
Modified time: Thu Jan 01 00:00:00 UTC 1970
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.0
Signature: null
Metadata: 
_ngt_=1387803327158
Content-Type=text/html
_pst_=temp_moved(13), lastModified=0: 
https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html
_repr_=http://www.example.org/shirt14-15/

Segment: 20131227011558
Segment: Version: 7
Status: 33 (fetch_success)
Fetch time: Fri Dec 27 01:19:38 UTC 2013
Modified time: Mon Dec 16 13:12:43 UTC 2013
Retries since fetch: 0
Retry interval: 907200 seconds (10 days)
Score: 0.0
Signature: 496376acd5a9d9e26cd4c54078ea1f1b
Metadata: 
_ngt_=1388106860582
Content-Type=application/xhtml+xml
_pst_=success(1), lastModified=0
_repr_=http://www.example.org/shirt14-15/

Segment: 20131230130048
Segment: Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 30 13:02:20 UTC 2013
Modified time: Thu Jan 01 00:00:00 UTC 1970
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.0
Signature: null
Metadata: 
_ngt_=1388408342373
Content-Type=text/html
_pst_=temp_moved(13), lastModified=0: 
https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html
_repr_=http://www.example.org/shirt14-15/

Segment: 20140106131930
Segment: Version: 7
Status: 67 (linked)
Fetch time: Mon Jan 06 13:20:21 UTC 2014
Modified time: Thu Jan 01 00:00:00 UTC 1970
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.0
Signature: null
Metadata: 
_ngt_=1389014244859
Content-Type=text/html
_pst_=temp_moved(13), lastModified=0: 
https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html
_repr_=http://www.example.org/shirt14-15/
{code}

This URL is indexed correctly by NUTCH-1113, NUTCH-1616 and the merged segment 
without LINKED and current trunk. Sebastian's patch does not index this 
document but does fix it for other records.


was (Author: markus17):
Sebastian's patch does solve a few problems indeed but i still see issues. I 
have another Solr cluster loaded with merged data, that segment does not 
contain LINKED at all. I found a URL that is indexed on that cluster but not on 
the cluster i loaded used Sebastian's patch on. I've tracked the URL's down and 
found several segments. These are all the CrawlDatum's i have on that URL, in 
order:

{code}
Segment: 20131223125654
Segment: Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 23 12:57:33 UTC 2013
Modified time: Thu Jan 01 00:00:00 UTC 1970
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.0
Signature: null
Metadata: 
_ngt_=1387803327158
Content-Type=text/html
_pst_=temp_moved(13), lastModified=0: 
https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html
_repr_=http://www.example.org/shirt14-15/

Segment: 20131227011558
Segment: Version: 7
Status: 33 (fetch_success)
Fetch time: Fri Dec 27 01:19:38 UTC 2013
Modified time: Mon Dec 16 13:12:43 UTC 2013
Retries since fetch: 0
Retry interval: 907200 seconds (10 days)
Score: 0.0
Signature: 496376acd5a9d9e26cd4c54078ea1f1b
Metadata: 
_ngt_=1388106860582
Content-Type=application/xhtml+xml
_pst_=success(1), lastModified=0
_repr_=http://www.example.org/shirt14-15/

Segment: 20131230130048
Segment: Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 30 13:02:20 UTC 2013
Modified time: Thu Jan 01 00:00:00 UTC 1970
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.0
Signature: null
Metadata: 
_ngt_=1388408342373
Content-Type=text/html
_pst_=temp_moved(13), lastModified=0: 
https://shop.example.org/tenue/wedstrijd-shirt-thuis-2014-2015.html
_repr_=http://www.example.org/shirt14-15/

Segment: 20140106131930
Segment: Version: 7
Status: 67 (linked)
Fetch time: Mon Jan 06 13:20:21 UTC 2014
Modified time: Thu Jan 01 00:00:00 UTC 1970
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.0
Signature: null
Metadata: 
_ngt_=1389014244859
Content-Type=text/html
_pst_=temp_moved(13), lastModified=0: 
https://shop.exampl

[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-09 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866789#comment-13866789
 ] 

Markus Jelsma edited comment on NUTCH-1113 at 1/9/14 4:52 PM:
--

New patch! Previous patch had an error in the checks. With this patch, 
everything passes on trunk and NUTCH-1113.

So, now it seems i haven't been able to reproduce a problem in tests!  :(


was (Author: markus17):
New patch! Previous patch had an error in the checks. With this patch, 
everything passes on trunk and NUTCH-1113.

So, no it seems i haven't been able to reproduce a problem in tests!  :(

> Merging segments causes URLs to vanish from crawldb/index?
> --
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Edward Drapkin
>Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
> unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)