Chong Li created NUTCH-1978:
-------------------------------

             Summary: solrindex will fail when indexing corrupted segments
                 Key: NUTCH-1978
                 URL: https://issues.apache.org/jira/browse/NUTCH-1978
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.10
            Reporter: Chong Li
            Priority: Minor
             Fix For: 1.10


The same issue from NUTCH-1771 but seems like this bug will appear in most of 
the versions since they all don't have the code to handle the corrupted 
segments.

Form NUTCH-1771, people pointed out that it will be very hard to handle this in 
the hadoop layer, and the program should skip the corrupted segments instead of 
end the program. By corrupted segments I mean that the segment may be just 
generated and doesn't have the content.

So my initial idea is to check if the segment folder is valid before putting 
the segment into the hadoop job. If the segment is not valid, we can simply 
just skip that segment. We can check if the segment folder contains exactly 6 
sub directories as there should be. The other  approach will be to check all 
the six sub directories and see if they are exactly the six dir that should 
appear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to