[jira] [Commented] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents

ASF GitHub Bot (Jira) Wed, 18 Feb 2026 07:33:10 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059360#comment-18059360
 ]


ASF GitHub Bot commented on NUTCH-1732:
---------------------------------------

igiguere commented on PR #891:
URL: https://github.com/apache/nutch/pull/891#issuecomment-3921493519

   > Hi @igiguere I had a think about this over the weekend and will share my 
thoughts on two issues
   > (...)
   > ## What Needs to Be Fixed
   > 
   >     1. The reducer must read `parser.delete.failed` in `setup()` and use 
it as an independent
   >        guard for the parse-failure deletion block, instead of reusing the 
`delete` variable
   >        (which is tied to `indexer.delete`). 
   > (...)
   > 
   >     2. `STATUS_PARSE_FAILED` (0x45) needs to be recognized in the 
`CrawlDatum` classification
   >        loop (lines 318–333).
   > (...)
   > 
   >     3. The test should pass a `CrawlDatum` with `STATUS_PARSE_FAILED` or
   >        `STATUS_DB_PARSE_FAILED` instead of (or in addition to) relying on 
`ParseData` with
   >        `STATUS_FAILURE`. 
   > (...)
   > 
   >     4. The test should verify that a DELETE action was actually emitted, 
not just that `doc` is
   >        null. 
   > (...)
   > 
   > 
   > # Additional comments
   > 
   >     1. I think the new configuration property should be changed from 
`parser.delete.failed` to `parser.delete.failed.parse` but this is minor in 
comparison to the above.
   > 
   >     2. Total side observation (unrelated to your PR) I noticed that the 
`indexer.delete` configuration property is absent from `nutch-default.xml`... 
is this intentional?
   
   Thanks for the review, @lewismc 
   And sorry.
   Really.  I feel like an idiot. This PR should not have been published in 
it's current state.
   I have to think back 10 years to remember any review on this scale.  I don't 
know what happened.  I should have seen at least 90% of all that myself.
   I'm fixing this.  I hope my brain activates this time.
   Thanks for the tip about `CrawlDBTestUtil.getServer()`.  I'll use that.




> IndexerMapReduce to delete explicitly not indexable documents
> -------------------------------------------------------------
>
>                 Key: NUTCH-1732
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1732
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.8
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.23
>
>
> In a continuous crawl a previously successfully indexed document (identified 
> by a URL) can become "not indexable" for a couple of reasons and must then 
> explicitly deleted from the index. Some cases are handled in IndexerMapReduce 
> (duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not:
> * failed to parse (but previously successfully parsed): e.g., the document 
> became larger and is now truncated
> * rejected by indexing filter (but previously accepted)
> In both cases (maybe there are more) the document should be explicitly 
> deleted (if {{-deleteGone}} is set). Note that this cannot be done in 
> CleaningJob because data from segments is required. 
> We should also update/add a description for {{-deleteGone}}: it does not only 
> trigger deletion of gone documents but also of redirects and duplicates (and 
> unparseable and skipped docs).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents

Reply via email to