Roannel Fernández Hernández commented on NUTCH-2616:

Send deletions to all index writers, seems to be the best option. Before the 
Exchange component exists, this is the behavior, right?

Passing documents with a single field might work, but you can only use the 
ID/URL field in JEXL expressions to ensure that the deletion actions match the 
exchange (at least for exchange-jexl), because in this case it will be the only 
field available. e.g. If you use {{<param name="expr" 
value="doc.getFieldValue('host')=='example.org'" />}}, all documents with 
host='example.org' will match, but in delete actions won't match even when 
id='http://example.org/' for instance, because the 'host' field doesn't exist 
in the document.

Another option could be to pass the documents with a single field and modify 
the exchange component to execute different routines depending the action to 
execute. The expression to be applied in each case would be in the 
exchanges.xml file as part of the configuration.

> Review routing of deletions by Exchange component
> -------------------------------------------------
>                 Key: NUTCH-2616
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2616
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
> If the exchange component (NUTCH-2412) is enabled it must also route 
> deletions (404, etc.) to the configured index writers. Deletions are done 
> alone using the document ID (URL), there is no NutchDocument (or it's null) 
> which needs to handled to avoid an NPE in the Exchanges class or the exchange 
> plugins.
> NUTCH-2412 has added a new delete method in the IndexWriters class:
> - {{delete(String, NutchDocument)}} is now called from the indexing job 
> ({{bin/nutch index ... -deleteGone}}). However, the NutchDocument is always 
> null in case of deletions, see IndexerMapReduce.DELETE_ACTION.
> - {{delete(String)}} is now a no-op but is still called from CleaningJob 
> ({{bin/nutch clean ...}})
> We could ([~roannel], are there better options?)
> - send deletions to all index writers. This causes a certain overhead (could 
> be critical if deletion lists are long).
> - pass a document containing only a single field (the document ID / URL) to 
> the exchange component.

This message was sent by Atlassian JIRA

Reply via email to