[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916840#action_12916840
 ] 

Doğacan Güney commented on NUTCH-864:
-

I don't think that's possible to do without doing a DataStore#get first as we 
do not want to override current status on URL. I guess we could write redirect 
status as a temporary status somewhere, but it would be too complex IMHO.

Julien, any ideas on how to set a redirect status without overwriting the 
current one?

 Fetcher generates entries with status 0
 ---

 Key: NUTCH-864
 URL: https://issues.apache.org/jira/browse/NUTCH-864
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
 Environment: Gora with SQLBackend
 URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
 Last Changed Rev: 980748
 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
Reporter: Julien Nioche
Assignee: Doğacan Güney
 Fix For: 2.0


 After a round of fetching which got the following protocol status :
 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
 1177 (SUCCESS=1177)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
 93 (EXCEPTION=93)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
 138  (TEMP_MOVED=138)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
 521 (MOVED=521)
 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
 There should not be any entries with status 0 (null)
 I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870
 ] 

Andrzej Bialecki  commented on NUTCH-907:
-

Hi Sertan,

Thanks for the patch, this looks very good! A few  comments:

* I'm not good at naming things either... schemaId is a little bit cryptic 
though. If we didn't already use crawlId I would vote for that (and then rename 
crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..

* since we now create multiple datasets, we need somehow to manage them - i.e. 
list and delete at least (create is implicit). There is no such functionality 
in this patch, but this can be addressed also as a separate issue.

* IndexerMapReduce.createIndexJob: I think it would be useful to pass the 
datasetId as a Job property - this way indexing filter plugins can use this 
property to populate NutchDocument fields if needed. FWIW, this may be a good 
idea to do in other jobs as well...

 DataStore API doesn't support multiple storage areas for multiple disjoint 
 crawls
 -

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-907.patch


 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
 page data, linkdb, etc) by specifying a path where the data was stored. This 
 enabled users to run several disjoint crawls with different configs, but 
 still using the same storage medium, just under different paths.
 This is not possible now because there is a 1:1 mapping between a specific 
 DataStore instance and a set of crawl data.
 In order to support this functionality the Gora API should be extended so 
 that it can create stores (and data tables in the underlying storage) that 
 use arbitrary prefixes to identify the particular crawl dataset. Then the 
 Nutch API should be extended to allow passing this crawlId value to select 
 one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

Doğacan, I missed your previous comment... the issue with partial bloom filters 
is usually solved that each task stores each own filter - this worked well for 
MapFile-s because they consisted of multiple parts, so then a Reader would open 
a part and a corresponding bloom filter.

Here it's more complicated, I agree... though this reminds me of the situation 
that is handled by DynamicBloomFilter: it's basically a set of Bloom filters 
with a facade that hides this fact from the user. Here we could construct 
something similar, i.e. don't merge partial filters after closing the output, 
but instead when opening a Reader read all partial filters and pretend they are 
one.

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: hostdb.patch, NUTCH-882-v1.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916888#action_12916888
 ] 

Doğacan Güney commented on NUTCH-864:
-

OK, let's do it :)

So, should we do a DataStore#get  to read previous status? I am not sure how 
best to implement this

 Fetcher generates entries with status 0
 ---

 Key: NUTCH-864
 URL: https://issues.apache.org/jira/browse/NUTCH-864
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
 Environment: Gora with SQLBackend
 URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
 Last Changed Rev: 980748
 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
Reporter: Julien Nioche
Assignee: Doğacan Güney
 Fix For: 2.0


 After a round of fetching which got the following protocol status :
 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
 1177 (SUCCESS=1177)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
 93 (EXCEPTION=93)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
 138  (TEMP_MOVED=138)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
 521 (MOVED=521)
 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
 There should not be any entries with status 0 (null)
 I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Sertan Alkan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916890#action_12916890
 ] 

Sertan Alkan commented on NUTCH-907:


Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now 
that you mention, I prefer to use {{crawlId}} for this and rename the old 
{{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's 
going to be, but I don't think it will be much of a hassle to change both all 
at once.
* I agree that arguments should override the configuration by actually setting 
it so that the setting could be accessible elsewhere. I'll modify the patch to 
work this way.
* A utility to handle the datasets is a good idea, though, considering the 
current GORA architecture I think we may need to add a client interface there 
somewhere. I've opened up an 
[issue|http://github.com/enis/gora/issues/issue/56] for this, we can start 
thinking about the design there. We won't be able write a generic utility in 
Nutch, though, since this won't be available till we roll out a new version of 
Gora. I'll pitch in the utility once we have that but as that doesn't affect 
this issue directly, I'd rather go for a separate issue for that. And until 
that issue is solved, I think it would be safe to leave manipulation of stores 
(listing, removing, truncation.. etc) to user's responsibility.

 DataStore API doesn't support multiple storage areas for multiple disjoint 
 crawls
 -

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-907.patch


 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
 page data, linkdb, etc) by specifying a path where the data was stored. This 
 enabled users to run several disjoint crawls with different configs, but 
 still using the same storage medium, just under different paths.
 This is not possible now because there is a 1:1 mapping between a specific 
 DataStore instance and a set of crawl data.
 In order to support this functionality the Gora API should be extended so 
 that it can create stores (and data tables in the underlying storage) that 
 use arbitrary prefixes to identify the particular crawl dataset. Then the 
 Nutch API should be extended to allow passing this crawlId value to select 
 one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Sertan Alkan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916890#action_12916890
 ] 

Sertan Alkan edited comment on NUTCH-907 at 10/1/10 10:10 AM:
--

Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now 
that you mention, I prefer to use {{crawlId}} for this and rename the old 
{{crawlId}} to {{batchId}}. I am not entirely sure how much invasive that's 
going to be, but I don't think it will be much of a hassle to change both all 
at once.
* I agree that arguments should override the configuration by actually setting 
it so that the setting could be accessible elsewhere. I'll modify the patch to 
work this way.
* A utility to handle the datasets is a good idea, though, considering the 
current GORA architecture I think we may need to add a client interface there 
somewhere. I've opened up an 
[issue|http://github.com/enis/gora/issues/issue/56] for this, we can start 
thinking about the design there. We won't be able write a generic utility in 
Nutch, though, since this won't be available till we roll out a new version of 
Gora. I'll pitch in the utility once we have that but as that doesn't affect 
this issue directly, I'd rather go for a separate issue for that. And until 
that issue is solved, I think it would be safe to leave manipulation of stores 
(listing, removing, truncation.. etc) to user's responsibility.

I'll modify the patch to reflect those two changes.

  was (Author: sertan):
Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now 
that you mention, I prefer to use {{crawlId}} for this and rename the old 
{{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's 
going to be, but I don't think it will be much of a hassle to change both all 
at once.
* I agree that arguments should override the configuration by actually setting 
it so that the setting could be accessible elsewhere. I'll modify the patch to 
work this way.
* A utility to handle the datasets is a good idea, though, considering the 
current GORA architecture I think we may need to add a client interface there 
somewhere. I've opened up an 
[issue|http://github.com/enis/gora/issues/issue/56] for this, we can start 
thinking about the design there. We won't be able write a generic utility in 
Nutch, though, since this won't be available till we roll out a new version of 
Gora. I'll pitch in the utility once we have that but as that doesn't affect 
this issue directly, I'd rather go for a separate issue for that. And until 
that issue is solved, I think it would be safe to leave manipulation of stores 
(listing, removing, truncation.. etc) to user's responsibility.
  
 DataStore API doesn't support multiple storage areas for multiple disjoint 
 crawls
 -

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-907.patch


 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
 page data, linkdb, etc) by specifying a path where the data was stored. This 
 enabled users to run several disjoint crawls with different configs, but 
 still using the same storage medium, just under different paths.
 This is not possible now because there is a 1:1 mapping between a specific 
 DataStore instance and a set of crawl data.
 In order to support this functionality the Gora API should be extended so 
 that it can create stores (and data tables in the underlying storage) that 
 use arbitrary prefixes to identify the particular crawl dataset. Then the 
 Nutch API should be extended to allow passing this crawlId value to select 
 one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

2010-10-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916899#action_12916899
 ] 

Doğacan Güney commented on NUTCH-894:
-

+1 from me. 

If there are no objections for the next couple days or so, I would like to 
commit this patch.

 Move statistical language identification from indexing to parsing step
 --

 Key: NUTCH-894
 URL: https://issues.apache.org/jira/browse/NUTCH-894
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-894.patch


 The statistical identification of language is currently done part in the 
 indexing step, whereas the detection based on HTTP header and HTML code is 
 done during the parsing.
 We could keep the same logic i.e. do the statistical detection only if 
 nothing has been found with the previous methods but as part of the parsing. 
 This would be useful for ParseFilters which need the language information or 
 to use with ScoringFilters e.g. to focus the crawl on a set of languages.
 Since the statistical models have been ported to Tika we should probably rely 
 on them instead of maintaining our own.
 Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912
 ] 

Andrzej Bialecki  commented on NUTCH-864:
-

I think the difficulty comes from the simplification in 2.x as compared to 1.x, 
in that we keep a single status per page. In 1.x a side-effect of having two 
locations with two statuses (one db status in crawldb and one fetch status 
in segments) was that we had more information in updatedb to act upon.

Now we should probably keep up to two statuses - one that reflects a temporary 
fetch status, as determined by fetcher, and a final (reconciled) status as 
determined by updatedb, based on the knoweldge of not only plain fetch status 
and old status but also possible redirects. If I'm not mistaken currently the 
status is immediately overwritten by fetcher, even before we come to updatedb, 
hence the problem..

 Fetcher generates entries with status 0
 ---

 Key: NUTCH-864
 URL: https://issues.apache.org/jira/browse/NUTCH-864
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
 Environment: Gora with SQLBackend
 URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
 Last Changed Rev: 980748
 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
Reporter: Julien Nioche
Assignee: Doğacan Güney
 Fix For: 2.0


 After a round of fetching which got the following protocol status :
 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
 1177 (SUCCESS=1177)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
 93 (EXCEPTION=93)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
 138  (TEMP_MOVED=138)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
 521 (MOVED=521)
 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
 There should not be any entries with status 0 (null)
 I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

2010-10-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916915#action_12916915
 ] 

Julien Nioche commented on NUTCH-894:
-

Nice one, that's exactly what I had in mind.
+1 for commiting

 Move statistical language identification from indexing to parsing step
 --

 Key: NUTCH-894
 URL: https://issues.apache.org/jira/browse/NUTCH-894
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-894.patch


 The statistical identification of language is currently done part in the 
 indexing step, whereas the detection based on HTTP header and HTML code is 
 done during the parsing.
 We could keep the same logic i.e. do the statistical detection only if 
 nothing has been found with the previous methods but as part of the parsing. 
 This would be useful for ParseFilters which need the language information or 
 to use with ScoringFilters e.g. to focus the crawl on a set of languages.
 Since the statistical models have been ported to Tika we should probably rely 
 on them instead of maintaining our own.
 Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.