[
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391647#comment-14391647
]
Sebastian Nagel commented on NUTCH-1771:
----------------------------------------
Hi [~chongli], the patch looks clean and extensible, just great. Thanks! What
about moving the code to a new class in o.a.n.segments? It will be useful (in a
more generic form) for other tools as well. The log message in case of a
skipped segment could be a warning.
Instead of deleting invalid segments, it's possible to ignore them. That's the
case if bin/crawl is repeatedly scheduled to run an incremental/continuous
crawl. If some job fails bin/crawl exits. A potentially incomplete/corrupted
segment is never looked at again, so there's no problem for later runs of
bin/crawl. That's because only CrawlDb (and LinkDb/WebGraph) are used for
persistence in this work-flow, content persists only in Solr/ElasticSearch. It
would be even possible to delete a segment immediately at the end of each
cycle. If segments are kept and used later (reparsed, reindexed, mined for
data, etc.), it's necessary to delete or skip invalid ones. And yes, a tool
which automatically detects invalid segments would be definitely useful!
Making tools more robust by ignoring some segments does not harm. It's the
easier way: make the work-flow detect and delete invalid segments is a bigger
effort. Btw., updatedb and web graph already silently skip segments not
containing required subdirs. LinkDb/invertlinks exits with an exception same as
IndexingJob. SegmentMerger is special by performing only a partial merge
excluding a subdir from all segments if this subdir is missing in a single
segment.
> Solrindex fails if a segment is corrupted or incomplete
> -------------------------------------------------------
>
> Key: NUTCH-1771
> URL: https://issues.apache.org/jira/browse/NUTCH-1771
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.8, 1.10
> Reporter: Diaa
> Priority: Minor
> Fix For: 1.11
>
>
> When using solrindex to index multiple segments via -dir segment,
> the indexing fails if one or more segments are corrupted/incomplete
> (generated but not fetched for example)
> The failure is simply java.io exception.
> Deleting the segment fixes the issue.
> The expected behavior should be one of the following:
> * skipping the segment and proceeding with others (while logging)
> * stopping the indexing and logging the failed segment
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)