[ 
https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633370#action_12633370
 ] 

Andrzej Bialecki  commented on NUTCH-451:
-----------------------------------------

I'm closing this issue, as the tool is not general enough to be included in 
Nutch. The code stays here, so anyone can still use it.

> Tool to recover partial fetcher output
> --------------------------------------
>
>                 Key: NUTCH-451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-451
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run. 
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. 
> you didn't use DFS - partial output to DFS is permanently lost if a process 
> fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial 
> SequenceFile-s will be corrupted at the end. This means that it won't be 
> possible to recover all data from them - most likely only the data up to the 
> last sync marker can be recovered.
> The recovery proces requires some preparation: 
> * determine the map directories corresponding to the map task outputs of the 
> failed job. These map directories contain SequenceFile-s consisting of pairs 
> of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s 
> into this directory, renaming them sequentially like this: 
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
>   
> * specify the "input" directory as the input to this tool. 
> If all goes well, a new segment will be created as a subdirectory of the 
> output dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to