Markus Jelsma created NUTCH-2214:
------------------------------------
Summary: Index clean to be flexible on what it deletes
Key: NUTCH-2214
URL: https://issues.apache.org/jira/browse/NUTCH-2214
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.11
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.13
Nutch clean removes all useless records, but if Nutch is configured correctly
(-deleteGone etc), the index should only contain duplicates, if existing. On a
large index, this could result in Nutch sending millions of getById's to Solr,
for records that don't exist in the first place.
This issue will make it configurable on what to delete, e.g. useless records
(404, 30x) or duplicates.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)