Sebastian Nagel created NUTCH-2616:
--------------------------------------

             Summary: Review routing of deletions by Exchange component
                 Key: NUTCH-2616
                 URL: https://issues.apache.org/jira/browse/NUTCH-2616
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.15
            Reporter: Sebastian Nagel
             Fix For: 1.15


If the exchange component (NUTCH-2412) is enabled it must also route deletions 
(404, etc.) to the configured index writers. Deletions are done alone using the 
document ID (URL), there is no NutchDocument (or it's null) which needs to 
handled to avoid an NPE in the Exchanges class or the exchange plugins.

NUTCH-2412 has added a new delete method in the IndexWriters class:
- {{delete(String, NutchDocument)}} is now called from the indexing job 
({{bin/nutch index ... -deleteGone}}). However, the NutchDocument is always 
null in case of deletions, see IndexerMapReduce.DELETE_ACTION.
- {{delete(String)}} is now a no-op but is still called from CleaningJob 
({{bin/nutch clean ...}})

We could ([~roannel], are there better options?)
- send deletions to all index writers. This causes a certain overhead (could be 
critical if deletion lists are long).
- pass a document containing only a single field (the document ID / URL) to the 
exchange component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to