igiguere commented on PR #891: URL: https://github.com/apache/nutch/pull/891#issuecomment-3921493519
> Hi @igiguere I had a think about this over the weekend and will share my thoughts on two issues > (...) > ## What Needs to Be Fixed > > 1. The reducer must read `parser.delete.failed` in `setup()` and use it as an independent > guard for the parse-failure deletion block, instead of reusing the `delete` variable > (which is tied to `indexer.delete`). > (...) > > 2. `STATUS_PARSE_FAILED` (0x45) needs to be recognized in the `CrawlDatum` classification > loop (lines 318–333). > (...) > > 3. The test should pass a `CrawlDatum` with `STATUS_PARSE_FAILED` or > `STATUS_DB_PARSE_FAILED` instead of (or in addition to) relying on `ParseData` with > `STATUS_FAILURE`. > (...) > > 4. The test should verify that a DELETE action was actually emitted, not just that `doc` is > null. > (...) > > > # Additional comments > > 1. I think the new configuration property should be changed from `parser.delete.failed` to `parser.delete.failed.parse` but this is minor in comparison to the above. > > 2. Total side observation (unrelated to your PR) I noticed that the `indexer.delete` configuration property is absent from `nutch-default.xml`... is this intentional? Thanks for the review, @lewismc And sorry. Really. I feel like an idiot. This PR should not have been published in it's current state. I have to think back 10 years to remember any review on this scale. I don't know what happened. I should have seen at least 90% of all that myself. I'm fixing this. I hope my brain activates this time. Thanks for the tip about `CrawlDBTestUtil.getServer()`. I'll use that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

