[
https://issues.apache.org/jira/browse/NUTCH-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060872#comment-18060872
]
ASF GitHub Bot commented on NUTCH-1732:
---------------------------------------
igiguere commented on PR #891:
URL: https://github.com/apache/nutch/pull/891#issuecomment-3954928830
> This was a really good refresher for me as well. It touches parts of the
codebase I hadn't looked at for a while... but I must admit, it took me a
while. I was scratching my head in the debugger for ages. @igiguere also,
apologies my comment above was a bit short. No need to apologize at all. Solid
peer review is the beauty of open source and definitely the part thats had the
most impact on me personally. Please ping me if and when you want more peer
review. Thank you.
@lewismc
I fixed `IndexerMapReduce`, so it is now properly configured.
I improved test for failed parsing in `TestCrawlDbStates` and
`TestIndexerMapReduce`.
The new test `TestFetchWithParseFailure` "should" test the fetcher, but it
strangely does not fully run most of the time. It's disabled while thinking of
a solution. Feel free to pitch in with ideas.
> IndexerMapReduce to delete explicitly not indexable documents
> -------------------------------------------------------------
>
> Key: NUTCH-1732
> URL: https://issues.apache.org/jira/browse/NUTCH-1732
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.8
> Reporter: Sebastian Nagel
> Priority: Critical
> Fix For: 1.23
>
>
> In a continuous crawl a previously successfully indexed document (identified
> by a URL) can become "not indexable" for a couple of reasons and must then
> explicitly deleted from the index. Some cases are handled in IndexerMapReduce
> (duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not:
> * failed to parse (but previously successfully parsed): e.g., the document
> became larger and is now truncated
> * rejected by indexing filter (but previously accepted)
> In both cases (maybe there are more) the document should be explicitly
> deleted (if {{-deleteGone}} is set). Note that this cannot be done in
> CleaningJob because data from segments is required.
> We should also update/add a description for {{-deleteGone}}: it does not only
> trigger deletion of gone documents but also of redirects and duplicates (and
> unparseable and skipped docs).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)