Re: [VOTE] Move 2.0 out of trunk
On 18/09/2011 02:21, Julien Nioche wrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. +1 - at this time it's clear that 2.0 didn't pan out as we expected, and we should restart from the 1.x for a usable platform, and continue redesign from that codebase. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean
[ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108633#comment-13108633 ] Julien Nioche commented on NUTCH-1052: -- I like the original idea and agree that having to read/write the whole crawldb once more would be a pain for large crawls. This is a good example of what 2.0 could add (or could have added if you are pessimistic). I agree with your suggestion for an alternative to the use of null as value which is to encode the action (add, delete) either as a complex object in the key or as part of the value. The latter would make more sense as it is unlikely that we'd add AND delete the same document as part of the same batch. Could you include that in your patch? Multiple deletes of the same URL using SolrClean Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3, 1.4 Reporter: Tim Pease Assignee: Julien Nioche Fix For: 1.4, 2.0 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch The SolrClean class does not keep track of purged URLs, it only checks the URL status for db_gone. When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation. One solution is to add a purged flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the db_gone status before adding the URL to the delete list. Another solution is to add a new state to the status field db_gone_and_purged. Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean
[ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108641#comment-13108641 ] Markus Jelsma commented on NUTCH-1052: -- Thanks for your comments! Just to make sure i understand you correctly. You agree we should use a container object that holds the NutchDocument and the action (ADD || DELETE) and pass it to RecordWriter.write() instead of abusing NULL encoded as delete action as i do it now? Then i'd need to add a class somewehere such as: {code} class NutchIndexAction { public static enum Action ADD,DELETE; public NutchDocument doc; public Action action; } {code} And pass a NutchIndexAction instance to the record writer from IndexerMapReduce? If so, what would be the appropriate location for such a class? Multiple deletes of the same URL using SolrClean Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3, 1.4 Reporter: Tim Pease Assignee: Julien Nioche Fix For: 1.4, 2.0 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch The SolrClean class does not keep track of purged URLs, it only checks the URL status for db_gone. When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation. One solution is to add a purged flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the db_gone status before adding the URL to the delete list. Another solution is to add a new state to the status field db_gone_and_purged. Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean
[ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108701#comment-13108701 ] Julien Nioche commented on NUTCH-1052: -- Yep, that's the idea. The class will have to be Writable and should live at the same place as NutchDocument i.e. org.apache.nutch.indexer Multiple deletes of the same URL using SolrClean Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3, 1.4 Reporter: Tim Pease Assignee: Julien Nioche Fix For: 1.4, 2.0 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch The SolrClean class does not keep track of purged URLs, it only checks the URL status for db_gone. When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation. One solution is to add a purged flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the db_gone status before adding the URL to the delete list. Another solution is to add a new state to the status field db_gone_and_purged. Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean
[ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108731#comment-13108731 ] Markus Jelsma commented on NUTCH-1052: -- I see. I did a quick modification and came up with this (ditched the enum and used static final byte instead): {code} package org.apache.nutch.indexer; class NutchIndexAction { public static final byte ADD = 0; public static final byte DELETE = 1; public NutchDocument doc = null; public byte action = 0; public NutchIndexAction(NutchDocument doc, byte action) { this.doc = doc; this.action = action; } } {code} All references of NutchDocument in IndexerMapReduce and IndexerOutputFormat have been replaced with the new NutchIndexAction. It compiles and runs as expected when running locally, without implementing Writable. I also moved the config param from SolrConstants to IndexerMapReduce so that IndexerMapReduce doesn't rely on indexing backend for getting it's param. Julien, will it break on Hadoop without implementing Writable? As you say i have to implement it, can you give a small example? I assume i have to write and read the class' attributes in order. Thanks again! Multiple deletes of the same URL using SolrClean Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3, 1.4 Reporter: Tim Pease Assignee: Julien Nioche Fix For: 1.4, 2.0 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch The SolrClean class does not keep track of purged URLs, it only checks the URL status for db_gone. When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation. One solution is to add a purged flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the db_gone status before adding the URL to the delete list. Another solution is to add a new state to the status field db_gone_and_purged. Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108751#comment-13108751 ] Lewis John McGibbney commented on NUTCH-1078: - Hi Sami, please see NUTCH-1091, which I opened. I think this is the next move to drop it completely, however I think there might be couple of issue here and there. Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.4 Attachments: NUTCH-1078-branch-1.4-20110816.patch, NUTCH-1078-branch-1.4-20110824-v2.patch, NUTCH-1078-branch-1.4-20110911-v3.patch, NUTCH-1078-branch-1.4-20110916-v4.patch Whilst working on another issue, I noticed that some classes still import and use commons logging for example HttpBase.java {code} import java.util.*; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.crawl.CrawlDatum; {code} At this stage I am unsure how many (if any others) still import and reply upon commons logging, however they should be upgraded to slf4j for branch-1.4. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean
[ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108757#comment-13108757 ] Julien Nioche commented on NUTCH-1052: -- {quote} Julien, will it break on Hadoop without implementing Writable? As you say i have to implement it, can you give a small example? I assume i have to write and read the class' attributes in order. {quote} Look at NutchDocument itself - it is a nice example of a Writable object. Multiple deletes of the same URL using SolrClean Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3, 1.4 Reporter: Tim Pease Assignee: Julien Nioche Fix For: 1.4, 2.0 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch The SolrClean class does not keep track of purged URLs, it only checks the URL status for db_gone. When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation. One solution is to add a purged flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the db_gone status before adding the URL to the delete list. Another solution is to add a new state to the status field db_gone_and_purged. Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean
[ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108763#comment-13108763 ] Markus Jelsma commented on NUTCH-1052: -- Thank, I already did :) I now write the action as single byte and use doc.write(out) to write the document itself. Seems to work at compile time and when running locally. Although i think when running locally the write and readFields methods are never called, i atleast don't get a runtime error. I'll try running it on the cluster tomorrow orso. Multiple deletes of the same URL using SolrClean Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3, 1.4 Reporter: Tim Pease Assignee: Julien Nioche Fix For: 1.4, 2.0 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch The SolrClean class does not keep track of purged URLs, it only checks the URL status for db_gone. When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation. One solution is to add a purged flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the db_gone status before adding the URL to the delete list. Another solution is to add a new state to the status field db_gone_and_purged. Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108975#comment-13108975 ] Sami Siren commented on NUTCH-1078: --- Oh, ok. I didn't realize there was another issue open about removing those. Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.4 Attachments: NUTCH-1078-branch-1.4-20110816.patch, NUTCH-1078-branch-1.4-20110824-v2.patch, NUTCH-1078-branch-1.4-20110911-v3.patch, NUTCH-1078-branch-1.4-20110916-v4.patch Whilst working on another issue, I noticed that some classes still import and use commons logging for example HttpBase.java {code} import java.util.*; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.crawl.CrawlDatum; {code} At this stage I am unsure how many (if any others) still import and reply upon commons logging, however they should be upgraded to slf4j for branch-1.4. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira