[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

Sebastian Nagel (JIRA) Tue, 31 Mar 2015 13:35:31 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389351#comment-14389351
 ]


Sebastian Nagel commented on NUTCH-1771:
----------------------------------------

>From [~chongli] in NUTCH-1978:
{quote}
So my initial idea is to check if the segment folder is valid before putting 
the segment into the hadoop job. If the segment is not valid, we can simply 
just skip that segment. We can check if the segment folder contains exactly 6 
sub directories as there should be. The other approach will be to check all the 
six sub directories and see if they are exactly the six dir that should appear.
{quote}

Ok, this would be possible.
* we should check only the 4 directories required for indexing: crawl_fetch, 
crawl_parse, parse_data, parse_text. The content directory may be missing if 
fetcher.store.content == false.
* if segments are really corrupted we need a more sophisticated check, but 
integrity checks should be part of HDFS. Only in local mode a crashed 
generate/fetch/parse may leave corrupted segments. However, any check needs to 
read the segment to checksum/validate it. That may cost a lot of IO and should 
not be done per default.

Unfetched segments (only containing crawl_generate) should be no problem after 
NUTCH-1829 (will be in 1.10) if the exit value of the generate job is properly 
checked (done by bin/crawl). But agreed: filtering out those segments would 
increase usability. Should be done only for the -dir option, calling index with 
one single corrupted/incomplete segment should definitely cause an error. Could 
be alternatively done by an extra SegmentFilter tool which then could also 
check for corrupted segments.





> Solrindex fails if a segment is corrupted or incomplete
> -------------------------------------------------------
>
>                 Key: NUTCH-1771
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1771
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.8
>            Reporter: Diaa
>            Priority: Minor
>             Fix For: 1.11
>
>
> When using solrindex to index multiple segments via -dir segment,
> the indexing fails if one or more segments are corrupted/incomplete 
> (generated but not fetched for example)
> The failure is simply java.io exception.
> Deleting the segment fixes the issue.
> The expected behavior should be one of the following:
> * skipping the segment and proceeding with others (while logging)
> * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

Reply via email to