Re: [VOTE] Move 2.0 out of trunk

2011-09-20 Thread Andrzej Bialecki

On 18/09/2011 02:21, Julien Nioche wrote:

Hi,

Following the discussions [1] on the dev-list about the future of Nutch
2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk
to a separate branch, promote 1.4 to trunk and consider 2.0 as
unmaintained. The arguments for / against can be found in the thread I
mentioned.

The vote is open for the next 72 hours.

[ ] +1 : Shelve 2.0 and move 1.4 to trunk
[] 0 : No opinion
[] -1 : Bad idea.  Please give justification.


+1 - at this time it's clear that 2.0 didn't pan out as we expected, and 
we should restart from the 1.x for a usable platform, and continue 
redesign from that codebase.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

2011-09-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108633#comment-13108633
 ] 

Julien Nioche commented on NUTCH-1052:
--

I like the original idea and agree that having to read/write the whole crawldb 
once more would be a pain for large crawls. This is a good example of what 2.0 
could add (or could have added if you are pessimistic). 

I agree with your suggestion for an alternative to the use of null as value 
which is to encode the action (add, delete) either as a complex object in the 
key or as part of the value. The latter would make more sense as it is unlikely 
that we'd add AND delete the same document as part of the same batch. Could you 
include that in your patch?

 Multiple deletes of the same URL using SolrClean
 

 Key: NUTCH-1052
 URL: https://issues.apache.org/jira/browse/NUTCH-1052
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3, 1.4
Reporter: Tim Pease
Assignee: Julien Nioche
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, 
 NUTCH-1052-1.4-3.patch


 The SolrClean class does not keep track of purged URLs, it only checks the 
 URL status for db_gone. When run multiple times the same list of URLs will 
 be deleted from Solr. For small, stable crawl databases this is not a 
 problem. For larger crawls this could be an issue. SolrClean will become an 
 expensive operation.
 One solution is to add a purged flag in the CrawlDatum metadata. SolrClean 
 would then check this flag in addition to the db_gone status before adding 
 the URL to the delete list.
 Another solution is to add a new state to the status field 
 db_gone_and_purged.
 Either way, the crawl DB will need to be updated after the Solr delete has 
 successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

2011-09-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108641#comment-13108641
 ] 

Markus Jelsma commented on NUTCH-1052:
--

Thanks for your comments! Just to make sure i understand you correctly. You 
agree we should use a container object that holds the NutchDocument and the 
action (ADD || DELETE) and pass it to RecordWriter.write() instead of abusing 
NULL encoded as delete action as i do it now?

Then i'd need to add a class somewehere such as:

{code}
class NutchIndexAction {
  public static enum Action ADD,DELETE;

  public NutchDocument doc;
  public Action action;
}
{code}

And pass a NutchIndexAction instance to the record writer from 
IndexerMapReduce? If so, what would be the appropriate location for such a 
class?

 Multiple deletes of the same URL using SolrClean
 

 Key: NUTCH-1052
 URL: https://issues.apache.org/jira/browse/NUTCH-1052
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3, 1.4
Reporter: Tim Pease
Assignee: Julien Nioche
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, 
 NUTCH-1052-1.4-3.patch


 The SolrClean class does not keep track of purged URLs, it only checks the 
 URL status for db_gone. When run multiple times the same list of URLs will 
 be deleted from Solr. For small, stable crawl databases this is not a 
 problem. For larger crawls this could be an issue. SolrClean will become an 
 expensive operation.
 One solution is to add a purged flag in the CrawlDatum metadata. SolrClean 
 would then check this flag in addition to the db_gone status before adding 
 the URL to the delete list.
 Another solution is to add a new state to the status field 
 db_gone_and_purged.
 Either way, the crawl DB will need to be updated after the Solr delete has 
 successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

2011-09-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108701#comment-13108701
 ] 

Julien Nioche commented on NUTCH-1052:
--

Yep, that's the idea.

The class will have to be Writable and should live at the same place as 
NutchDocument i.e. org.apache.nutch.indexer



 Multiple deletes of the same URL using SolrClean
 

 Key: NUTCH-1052
 URL: https://issues.apache.org/jira/browse/NUTCH-1052
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3, 1.4
Reporter: Tim Pease
Assignee: Julien Nioche
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, 
 NUTCH-1052-1.4-3.patch


 The SolrClean class does not keep track of purged URLs, it only checks the 
 URL status for db_gone. When run multiple times the same list of URLs will 
 be deleted from Solr. For small, stable crawl databases this is not a 
 problem. For larger crawls this could be an issue. SolrClean will become an 
 expensive operation.
 One solution is to add a purged flag in the CrawlDatum metadata. SolrClean 
 would then check this flag in addition to the db_gone status before adding 
 the URL to the delete list.
 Another solution is to add a new state to the status field 
 db_gone_and_purged.
 Either way, the crawl DB will need to be updated after the Solr delete has 
 successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

2011-09-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108731#comment-13108731
 ] 

Markus Jelsma commented on NUTCH-1052:
--

I see. I did a quick modification and came up with this (ditched the enum and 
used static final byte instead):

{code}
package org.apache.nutch.indexer;

class NutchIndexAction {

  public static final byte ADD = 0;
  public static final byte DELETE = 1;

  public NutchDocument doc = null;
  public byte action = 0;

  public NutchIndexAction(NutchDocument doc, byte action) {
this.doc = doc;
this.action = action;
  }
}
{code}

All references of NutchDocument in IndexerMapReduce and IndexerOutputFormat 
have been replaced with the new NutchIndexAction. It compiles and runs as 
expected when running locally, without implementing Writable. I also moved the 
config param from SolrConstants to IndexerMapReduce so that IndexerMapReduce 
doesn't rely on indexing backend for getting it's param.

Julien, will it break on Hadoop without implementing Writable? As you say i 
have to implement it, can you give a small example? I assume i have to write 
and read the class' attributes in order.

Thanks again!


 Multiple deletes of the same URL using SolrClean
 

 Key: NUTCH-1052
 URL: https://issues.apache.org/jira/browse/NUTCH-1052
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3, 1.4
Reporter: Tim Pease
Assignee: Julien Nioche
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, 
 NUTCH-1052-1.4-3.patch


 The SolrClean class does not keep track of purged URLs, it only checks the 
 URL status for db_gone. When run multiple times the same list of URLs will 
 be deleted from Solr. For small, stable crawl databases this is not a 
 problem. For larger crawls this could be an issue. SolrClean will become an 
 expensive operation.
 One solution is to add a purged flag in the CrawlDatum metadata. SolrClean 
 would then check this flag in addition to the db_gone status before adding 
 the URL to the delete list.
 Another solution is to add a new state to the status field 
 db_gone_and_purged.
 Either way, the crawl DB will need to be updated after the Solr delete has 
 successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108751#comment-13108751
 ] 

Lewis John McGibbney commented on NUTCH-1078:
-

Hi Sami, please see NUTCH-1091, which I opened. I think this is the next move 
to drop it completely, however I think there might be couple of issue here and 
there. 

 Upgrade all instances of commons logging to slf4j (with log4j backend)
 --

 Key: NUTCH-1078
 URL: https://issues.apache.org/jira/browse/NUTCH-1078
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
 NUTCH-1078-branch-1.4-20110824-v2.patch, 
 NUTCH-1078-branch-1.4-20110911-v3.patch, 
 NUTCH-1078-branch-1.4-20110916-v4.patch


 Whilst working on another issue, I noticed that some classes still import and 
 use commons logging for example HttpBase.java
 {code}
 import java.util.*;
 // Commons Logging imports
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 {code}
 At this stage I am unsure how many (if any others) still import and reply 
 upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

2011-09-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108757#comment-13108757
 ] 

Julien Nioche commented on NUTCH-1052:
--

{quote}
Julien, will it break on Hadoop without implementing Writable? As you say i 
have to implement it, can you give a small example? I assume i have to write 
and read the class' attributes in order.
{quote}

Look at NutchDocument itself - it is a nice example of a Writable object.

 Multiple deletes of the same URL using SolrClean
 

 Key: NUTCH-1052
 URL: https://issues.apache.org/jira/browse/NUTCH-1052
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3, 1.4
Reporter: Tim Pease
Assignee: Julien Nioche
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, 
 NUTCH-1052-1.4-3.patch


 The SolrClean class does not keep track of purged URLs, it only checks the 
 URL status for db_gone. When run multiple times the same list of URLs will 
 be deleted from Solr. For small, stable crawl databases this is not a 
 problem. For larger crawls this could be an issue. SolrClean will become an 
 expensive operation.
 One solution is to add a purged flag in the CrawlDatum metadata. SolrClean 
 would then check this flag in addition to the db_gone status before adding 
 the URL to the delete list.
 Another solution is to add a new state to the status field 
 db_gone_and_purged.
 Either way, the crawl DB will need to be updated after the Solr delete has 
 successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

2011-09-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108763#comment-13108763
 ] 

Markus Jelsma commented on NUTCH-1052:
--

Thank, I already did :) I now write the action as single byte and use 
doc.write(out) to write the document itself. Seems to work at compile time and 
when running locally. Although i think when running locally the write and 
readFields methods are never called, i atleast don't get a runtime error.

I'll try running it on the cluster tomorrow orso.

 Multiple deletes of the same URL using SolrClean
 

 Key: NUTCH-1052
 URL: https://issues.apache.org/jira/browse/NUTCH-1052
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3, 1.4
Reporter: Tim Pease
Assignee: Julien Nioche
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, 
 NUTCH-1052-1.4-3.patch


 The SolrClean class does not keep track of purged URLs, it only checks the 
 URL status for db_gone. When run multiple times the same list of URLs will 
 be deleted from Solr. For small, stable crawl databases this is not a 
 problem. For larger crawls this could be an issue. SolrClean will become an 
 expensive operation.
 One solution is to add a purged flag in the CrawlDatum metadata. SolrClean 
 would then check this flag in addition to the db_gone status before adding 
 the URL to the delete list.
 Another solution is to add a new state to the status field 
 db_gone_and_purged.
 Either way, the crawl DB will need to be updated after the Solr delete has 
 successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-20 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108975#comment-13108975
 ] 

Sami Siren commented on NUTCH-1078:
---

Oh, ok. I didn't realize there was another issue open about removing those.

 Upgrade all instances of commons logging to slf4j (with log4j backend)
 --

 Key: NUTCH-1078
 URL: https://issues.apache.org/jira/browse/NUTCH-1078
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
 NUTCH-1078-branch-1.4-20110824-v2.patch, 
 NUTCH-1078-branch-1.4-20110911-v3.patch, 
 NUTCH-1078-branch-1.4-20110916-v4.patch


 Whilst working on another issue, I noticed that some classes still import and 
 use commons logging for example HttpBase.java
 {code}
 import java.util.*;
 // Commons Logging imports
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 {code}
 At this stage I am unsure how many (if any others) still import and reply 
 upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira