Sebastian Nagel created NUTCH-1732: -------------------------------------- Summary: IndexerMapReduce to delete explicitly not indexable documents Key: NUTCH-1732 URL: https://issues.apache.org/jira/browse/NUTCH-1732 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.8 Reporter: Sebastian Nagel Fix For: 1.9
In a continuous crawl a previously successfully indexed document (identified by a URL) can become "not indexable" for a couple of reasons and must then explicitly deleted from the index. Some cases are handled in IndexerMapReduce (duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not: * failed to parse (but previously successfully parsed): e.g., the document became larger and is now truncated * rejected by indexing filter (but previously accepted) In both cases (maybe there are more) the document should be explicitly deleted (if {{-deleteGone}} is set). Note that this cannot be done in CleaningJob because data from segments is required. We should also update/add a description for {{-deleteGone}}: it does not only trigger deletion of gone documents but also of redirects and duplicates (and unparseable and skipped docs). -- This message was sent by Atlassian JIRA (v6.1.5#6160)