[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391647#comment-14391647
 ] 

Sebastian Nagel commented on NUTCH-1771:
----------------------------------------

Hi [~chongli], the patch looks clean and extensible, just great. Thanks! What 
about moving the code to a new class in o.a.n.segments? It will be useful (in a 
more generic form) for other tools as well. The log message in case of a 
skipped segment could be a warning.

Instead of deleting invalid segments, it's possible to ignore them. That's the 
case if bin/crawl is repeatedly scheduled to run an incremental/continuous 
crawl. If some job fails bin/crawl exits. A potentially incomplete/corrupted 
segment is never looked at again, so there's no problem for later runs of 
bin/crawl. That's because only CrawlDb (and LinkDb/WebGraph) are used for 
persistence in this work-flow, content persists only in Solr/ElasticSearch. It 
would be even possible to delete a segment immediately at the end of each 
cycle. If segments are kept and used later (reparsed, reindexed, mined for 
data, etc.), it's necessary to delete or skip invalid ones. And yes, a tool 
which automatically detects invalid segments would be definitely useful!

Making tools more robust by ignoring some segments does not harm. It's the 
easier way: make the work-flow detect and delete invalid segments is a bigger 
effort. Btw., updatedb and web graph already silently skip segments not 
containing required subdirs. LinkDb/invertlinks exits with an exception same as 
IndexingJob. SegmentMerger is special by performing only a partial merge 
excluding a subdir from all segments if this subdir is missing in a single 
segment. 

> Solrindex fails if a segment is corrupted or incomplete
> -------------------------------------------------------
>
>                 Key: NUTCH-1771
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1771
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.8, 1.10
>            Reporter: Diaa
>            Priority: Minor
>             Fix For: 1.11
>
>
> When using solrindex to index multiple segments via -dir segment,
> the indexing fails if one or more segments are corrupted/incomplete 
> (generated but not fetched for example)
> The failure is simply java.io exception.
> Deleting the segment fixes the issue.
> The expected behavior should be one of the following:
> * skipping the segment and proceeding with others (while logging)
> * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to