[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916840#action_12916840 ] Doğacan Güney commented on NUTCH-864: - I don't think that's possible to do without doing a DataStore#get first as we do not want to override current status on URL. I guess we could write redirect status as a temporary status somewhere, but it would be too complex IMHO. Julien, any ideas on how to set a redirect status without overwriting the current one? Fetcher generates entries with status 0 --- Key: NUTCH-864 URL: https://issues.apache.org/jira/browse/NUTCH-864 Project: Nutch Issue Type: Bug Components: fetcher Environment: Gora with SQLBackend URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase Last Changed Rev: 980748 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) Reporter: Julien Nioche Assignee: Doğacan Güney Fix For: 2.0 After a round of fetching which got the following protocol status : 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): 1177 (SUCCESS=1177) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): 93 (EXCEPTION=93) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): 138 (TEMP_MOVED=138) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): 521 (MOVED=521) 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done There should not be any entries with status 0 (null) I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870 ] Andrzej Bialecki commented on NUTCH-907: - Hi Sertan, Thanks for the patch, this looks very good! A few comments: * I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId .. * since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue. * IndexerMapReduce.createIndexJob: I think it would be useful to pass the datasetId as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well... DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-907.patch In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874 ] Andrzej Bialecki commented on NUTCH-882: - Doğacan, I missed your previous comment... the issue with partial bloom filters is usually solved that each task stores each own filter - this worked well for MapFile-s because they consisted of multiple parts, so then a Reader would open a part and a corresponding bloom filter. Here it's more complicated, I agree... though this reminds me of the situation that is handled by DynamicBloomFilter: it's basically a set of Bloom filters with a facade that hides this fact from the user. Here we could construct something similar, i.e. don't merge partial filters after closing the output, but instead when opening a Reader read all partial filters and pretend they are one. Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: hostdb.patch, NUTCH-882-v1.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916888#action_12916888 ] Doğacan Güney commented on NUTCH-864: - OK, let's do it :) So, should we do a DataStore#get to read previous status? I am not sure how best to implement this Fetcher generates entries with status 0 --- Key: NUTCH-864 URL: https://issues.apache.org/jira/browse/NUTCH-864 Project: Nutch Issue Type: Bug Components: fetcher Environment: Gora with SQLBackend URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase Last Changed Rev: 980748 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) Reporter: Julien Nioche Assignee: Doğacan Güney Fix For: 2.0 After a round of fetching which got the following protocol status : 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): 1177 (SUCCESS=1177) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): 93 (EXCEPTION=93) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): 138 (TEMP_MOVED=138) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): 521 (MOVED=521) 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done There should not be any entries with status 0 (null) I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916890#action_12916890 ] Sertan Alkan commented on NUTCH-907: Hi Andrzej, Thanks for the review and the feedback. * Funny thing, I was actually going for {{datasetId}} for the name, but now that you mention, I prefer to use {{crawlId}} for this and rename the old {{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once. * I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way. * A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an [issue|http://github.com/enis/gora/issues/issue/56] for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility. DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-907.patch In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916890#action_12916890 ] Sertan Alkan edited comment on NUTCH-907 at 10/1/10 10:10 AM: -- Hi Andrzej, Thanks for the review and the feedback. * Funny thing, I was actually going for {{datasetId}} for the name, but now that you mention, I prefer to use {{crawlId}} for this and rename the old {{crawlId}} to {{batchId}}. I am not entirely sure how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once. * I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way. * A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an [issue|http://github.com/enis/gora/issues/issue/56] for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility. I'll modify the patch to reflect those two changes. was (Author: sertan): Hi Andrzej, Thanks for the review and the feedback. * Funny thing, I was actually going for {{datasetId}} for the name, but now that you mention, I prefer to use {{crawlId}} for this and rename the old {{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once. * I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way. * A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an [issue|http://github.com/enis/gora/issues/issue/56] for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility. DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-907.patch In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step
[ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916899#action_12916899 ] Doğacan Güney commented on NUTCH-894: - +1 from me. If there are no objections for the next couple days or so, I would like to commit this patch. Move statistical language identification from indexing to parsing step -- Key: NUTCH-894 URL: https://issues.apache.org/jira/browse/NUTCH-894 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: NUTCH-894.patch The statistical identification of language is currently done part in the indexing step, whereas the detection based on HTTP header and HTML code is done during the parsing. We could keep the same logic i.e. do the statistical detection only if nothing has been found with the previous methods but as part of the parsing. This would be useful for ParseFilters which need the language information or to use with ScoringFilters e.g. to focus the crawl on a set of languages. Since the statistical models have been ported to Tika we should probably rely on them instead of maintaining our own. Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912 ] Andrzej Bialecki commented on NUTCH-864: - I think the difficulty comes from the simplification in 2.x as compared to 1.x, in that we keep a single status per page. In 1.x a side-effect of having two locations with two statuses (one db status in crawldb and one fetch status in segments) was that we had more information in updatedb to act upon. Now we should probably keep up to two statuses - one that reflects a temporary fetch status, as determined by fetcher, and a final (reconciled) status as determined by updatedb, based on the knoweldge of not only plain fetch status and old status but also possible redirects. If I'm not mistaken currently the status is immediately overwritten by fetcher, even before we come to updatedb, hence the problem.. Fetcher generates entries with status 0 --- Key: NUTCH-864 URL: https://issues.apache.org/jira/browse/NUTCH-864 Project: Nutch Issue Type: Bug Components: fetcher Environment: Gora with SQLBackend URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase Last Changed Rev: 980748 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) Reporter: Julien Nioche Assignee: Doğacan Güney Fix For: 2.0 After a round of fetching which got the following protocol status : 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): 1177 (SUCCESS=1177) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): 93 (EXCEPTION=93) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): 138 (TEMP_MOVED=138) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): 521 (MOVED=521) 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done There should not be any entries with status 0 (null) I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step
[ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916915#action_12916915 ] Julien Nioche commented on NUTCH-894: - Nice one, that's exactly what I had in mind. +1 for commiting Move statistical language identification from indexing to parsing step -- Key: NUTCH-894 URL: https://issues.apache.org/jira/browse/NUTCH-894 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: NUTCH-894.patch The statistical identification of language is currently done part in the indexing step, whereas the detection based on HTTP header and HTML code is done during the parsing. We could keep the same logic i.e. do the statistical detection only if nothing has been found with the previous methods but as part of the parsing. This would be useful for ParseFilters which need the language information or to use with ScoringFilters e.g. to focus the crawl on a set of languages. Since the statistical models have been ported to Tika we should probably rely on them instead of maintaining our own. Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.