[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Markus Jelsma (JIRA) Tue, 20 Sep 2011 07:14:34 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108731#comment-13108731
 ]


Markus Jelsma commented on NUTCH-1052:
--------------------------------------

I see. I did a quick modification and came up with this (ditched the enum and 
used static final byte instead):

{code}
package org.apache.nutch.indexer;

class NutchIndexAction {

  public static final byte ADD = 0;
  public static final byte DELETE = 1;

  public NutchDocument doc = null;
  public byte action = 0;

  public NutchIndexAction(NutchDocument doc, byte action) {
    this.doc = doc;
    this.action = action;
  }
}
{code}

All references of NutchDocument in IndexerMapReduce and IndexerOutputFormat 
have been replaced with the new NutchIndexAction. It compiles and runs as 
expected when running locally, without implementing Writable. I also moved the 
config param from SolrConstants to IndexerMapReduce so that IndexerMapReduce 
doesn't rely on indexing backend for getting it's param.

Julien, will it break on Hadoop without implementing Writable? As you say i 
have to implement it, can you give a small example? I assume i have to write 
and read the class' attributes in order.

Thanks again!


> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, 
> NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the 
> URL status for "db_gone". When run multiple times the same list of URLs will 
> be deleted from Solr. For small, stable crawl databases this is not a 
> problem. For larger crawls this could be an issue. SolrClean will become an 
> expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean 
> would then check this flag in addition to the "db_gone" status before adding 
> the URL to the delete list.
> Another solution is to add a new state to the status field 
> "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has 
> successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Reply via email to