date:20101001

[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916840#action_12916840
 ] 

Doğacan Güney commented on NUTCH-864:
-

I don't think that's possible to do without doing a DataStore#get first as we 
do not want to override current status on URL. I guess we could write redirect 
status as a temporary status somewhere, but it would be too complex IMHO.

Julien, any ideas on how to set a redirect status without overwriting the 
current one?

 Fetcher generates entries with status 0
 ---

 Key: NUTCH-864
 URL: https://issues.apache.org/jira/browse/NUTCH-864
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
 Environment: Gora with SQLBackend
 URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
 Last Changed Rev: 980748
 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
Reporter: Julien Nioche
Assignee: Doğacan Güney
 Fix For: 2.0


 After a round of fetching which got the following protocol status :
 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
 1177 (SUCCESS=1177)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
 93 (EXCEPTION=93)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
 138  (TEMP_MOVED=138)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
 521 (MOVED=521)
 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
 There should not be any entries with status 0 (null)
 I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870
]

Andrzej Bialecki commented on NUTCH-907:
-

Hi Sertan,

Thanks for the patch, this looks very good! A few comments:

* I'm not good at naming things either... schemaId is a little bit cryptic
though. If we didn't already use crawlId I would vote for that (and then rename
crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..

* since we now create multiple datasets, we need somehow to manage them - i.e.
list and delete at least (create is implicit). There is no such functionality
in this patch, but this can be addressed also as a separate issue.

* IndexerMapReduce.createIndexJob: I think it would be useful to pass the
datasetId as a Job property - this way indexing filter plugins can use this
property to populate NutchDocument fields if needed. FWIW, this may be a good
idea to do in other jobs as well...

DataStore API doesn't support multiple storage areas for multiple disjoint
crawls
-

Key: NUTCH-907
URL: https://issues.apache.org/jira/browse/NUTCH-907
Project: Nutch
Issue Type: Bug
Reporter: Andrzej Bialecki
Fix For: 2.0

Attachments: NUTCH-907.patch

In Nutch 1.x it was possible to easily select a set of crawl data (crawldb,
page data, linkdb, etc) by specifying a path where the data was stored. This
enabled users to run several disjoint crawls with different configs, but
still using the same storage medium, just under different paths.
This is not possible now because there is a 1:1 mapping between a specific
DataStore instance and a set of crawl data.
In order to support this functionality the Gora API should be extended so
that it can create stores (and data tables in the underlying storage) that
use arbitrary prefixes to identify the particular crawl dataset. Then the
Nutch API should be extended to allow passing this crawlId value to select
one of possibly many existing crawl datasets.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-10-01 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874
]

Andrzej Bialecki commented on NUTCH-882:
-

Doğacan, I missed your previous comment... the issue with partial bloom filters
is usually solved that each task stores each own filter - this worked well for
MapFile-s because they consisted of multiple parts, so then a Reader would open
a part and a corresponding bloom filter.

Here it's more complicated, I agree... though this reminds me of the situation
that is handled by DynamicBloomFilter: it's basically a set of Bloom filters
with a facade that hides this fact from the user. Here we could construct
something similar, i.e. don't merge partial filters after closing the output,
but instead when opening a Reader read all partial filters and pretend they are
one.

Design a Host table in GORA
---

Key: NUTCH-882
URL: https://issues.apache.org/jira/browse/NUTCH-882
Project: Nutch
Issue Type: New Feature
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
Fix For: 2.0

Attachments: hostdb.patch, NUTCH-882-v1.patch

Having a separate GORA table for storing information about hosts (and
domains?) would be very useful for :
* customising the behaviour of the fetching on a host basis e.g. number of
threads, min time between threads etc...
* storing stats
* keeping metadata and possibly propagate them to the webpages
* keeping a copy of the robots.txt and possibly use that later to filter the
webtable
* store sitemaps files and update the webtable accordingly
I'll try to come up with a GORA schema for such a host table but any comments
are of course already welcome

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916888#action_12916888
 ] 

Doğacan Güney commented on NUTCH-864:
-

OK, let's do it :)

So, should we do a DataStore#get  to read previous status? I am not sure how 
best to implement this

 Fetcher generates entries with status 0
 ---

 Key: NUTCH-864
 URL: https://issues.apache.org/jira/browse/NUTCH-864
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
 Environment: Gora with SQLBackend
 URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
 Last Changed Rev: 980748
 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
Reporter: Julien Nioche
Assignee: Doğacan Güney
 Fix For: 2.0


 After a round of fetching which got the following protocol status :
 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
 1177 (SUCCESS=1177)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
 93 (EXCEPTION=93)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
 138  (TEMP_MOVED=138)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
 521 (MOVED=521)
 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
 There should not be any entries with status 0 (null)
 I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Sertan Alkan (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916890#action_12916890
]

Sertan Alkan commented on NUTCH-907:

Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now
that you mention, I prefer to use {{crawlId}} for this and rename the old
{{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's
going to be, but I don't think it will be much of a hassle to change both all
at once.
* I agree that arguments should override the configuration by actually setting
it so that the setting could be accessible elsewhere. I'll modify the patch to
work this way.
* A utility to handle the datasets is a good idea, though, considering the
current GORA architecture I think we may need to add a client interface there
somewhere. I've opened up an
[issue|http://github.com/enis/gora/issues/issue/56] for this, we can start
thinking about the design there. We won't be able write a generic utility in
Nutch, though, since this won't be available till we roll out a new version of
Gora. I'll pitch in the utility once we have that but as that doesn't affect
this issue directly, I'd rather go for a separate issue for that. And until
that issue is solved, I think it would be safe to leave manipulation of stores
(listing, removing, truncation.. etc) to user's responsibility.

DataStore API doesn't support multiple storage areas for multiple disjoint
crawls
-

Key: NUTCH-907
URL: https://issues.apache.org/jira/browse/NUTCH-907
Project: Nutch
Issue Type: Bug
Reporter: Andrzej Bialecki
Fix For: 2.0

Attachments: NUTCH-907.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Sertan Alkan (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916890#action_12916890
]

Sertan Alkan edited comment on NUTCH-907 at 10/1/10 10:10 AM:
--

Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now
that you mention, I prefer to use {{crawlId}} for this and rename the old
{{crawlId}} to {{batchId}}. I am not entirely sure how much invasive that's
going to be, but I don't think it will be much of a hassle to change both all
at once.
* I agree that arguments should override the configuration by actually setting
it so that the setting could be accessible elsewhere. I'll modify the patch to
work this way.
* A utility to handle the datasets is a good idea, though, considering the
current GORA architecture I think we may need to add a client interface there
somewhere. I've opened up an
[issue|http://github.com/enis/gora/issues/issue/56] for this, we can start
thinking about the design there. We won't be able write a generic utility in
Nutch, though, since this won't be available till we roll out a new version of
Gora. I'll pitch in the utility once we have that but as that doesn't affect
this issue directly, I'd rather go for a separate issue for that. And until
that issue is solved, I think it would be safe to leave manipulation of stores
(listing, removing, truncation.. etc) to user's responsibility.

I'll modify the patch to reflect those two changes.

was (Author: sertan):
Hi Andrzej,

Thanks for the review and the feedback.

DataStore API doesn't support multiple storage areas for multiple disjoint
crawls
-

Key: NUTCH-907
URL: https://issues.apache.org/jira/browse/NUTCH-907
Project: Nutch
Issue Type: Bug
Reporter: Andrzej Bialecki
Fix For: 2.0

Attachments: NUTCH-907.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

2010-10-01 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916899#action_12916899
 ] 

Doğacan Güney commented on NUTCH-894:
-

+1 from me. 

If there are no objections for the next couple days or so, I would like to 
commit this patch.

 Move statistical language identification from indexing to parsing step
 --

 Key: NUTCH-894
 URL: https://issues.apache.org/jira/browse/NUTCH-894
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-894.patch


 The statistical identification of language is currently done part in the 
 indexing step, whereas the detection based on HTTP header and HTML code is 
 done during the parsing.
 We could keep the same logic i.e. do the statistical detection only if 
 nothing has been found with the previous methods but as part of the parsing. 
 This would be useful for ParseFilters which need the language information or 
 to use with ScoringFilters e.g. to focus the crawl on a set of languages.
 Since the statistical models have been ported to Tika we should probably rely 
 on them instead of maintaining our own.
 Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912
 ] 

Andrzej Bialecki  commented on NUTCH-864:
-

I think the difficulty comes from the simplification in 2.x as compared to 1.x, 
in that we keep a single status per page. In 1.x a side-effect of having two 
locations with two statuses (one db status in crawldb and one fetch status 
in segments) was that we had more information in updatedb to act upon.

Now we should probably keep up to two statuses - one that reflects a temporary 
fetch status, as determined by fetcher, and a final (reconciled) status as 
determined by updatedb, based on the knoweldge of not only plain fetch status 
and old status but also possible redirects. If I'm not mistaken currently the 
status is immediately overwritten by fetcher, even before we come to updatedb, 
hence the problem..

 Fetcher generates entries with status 0
 ---

 Key: NUTCH-864
 URL: https://issues.apache.org/jira/browse/NUTCH-864
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
 Environment: Gora with SQLBackend
 URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
 Last Changed Rev: 980748
 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
Reporter: Julien Nioche
Assignee: Doğacan Güney
 Fix For: 2.0


 After a round of fetching which got the following protocol status :
 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
 1177 (SUCCESS=1177)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
 93 (EXCEPTION=93)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
 138  (TEMP_MOVED=138)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
 521 (MOVED=521)
 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
 There should not be any entries with status 0 (null)
 I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

2010-10-01 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916915#action_12916915
 ] 

Julien Nioche commented on NUTCH-894:
-

Nice one, that's exactly what I had in mind.
+1 for commiting

 Move statistical language identification from indexing to parsing step
 --

 Key: NUTCH-894
 URL: https://issues.apache.org/jira/browse/NUTCH-894
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-894.patch


 The statistical identification of language is currently done part in the 
 indexing step, whereas the detection based on HTTP header and HTML code is 
 done during the parsing.
 We could keep the same logic i.e. do the statistical detection only if 
 nothing has been found with the previous methods but as part of the parsing. 
 This would be useful for ParseFilters which need the language information or 
 to use with ScoringFilters e.g. to focus the crawl on a set of languages.
 Since the statistical models have been ported to Tika we should probably rely 
 on them instead of maintaining our own.
 Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

[jira] Commented: (NUTCH-882) Design a Host table in GORA

[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

[jira] Issue Comment Edited: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

9 matches

Site Navigation

Mail list logo

Footer information