[ 
https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480098
 ] 

Mathijs Homminga commented on NUTCH-451:
----------------------------------------

While fetching a segment with 4M documents, we ran out of diskspace. 
We managed to recover most of our data using the LocalFetchRecover tool.

* First, the tool needed some modifications in order to work with Nutch 0.8.1 
(see file attached)

* We copied the map task output file of the failed job to the tool's input 
directory (we had only one input file)
cp /tmp/hadoop/mapred/local/map_p7xlb2/part-0.out /tmp/recovery/input

* Next, we ran the LocalFetchRecover tool. After a few hours we got a 
EOFException because our input file was not closed properly. LocalFetchRecover 
uses an IndentityMapper, so the output from its map tasks is exactly the same 
as the input, only split into more parts. Knowing this, we ran the tool again 
using the newly created map task output files as input. 

* Before we ran the tool again, we had to remove the last map output file 
because it will cause another EOFException. 

* Done! Our segment was created successfully in 
/tmp/recovery/output/20070228225939/




> Tool to recover partial fetcher output
> --------------------------------------
>
>                 Key: NUTCH-451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-451
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run. 
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. 
> you didn't use DFS - partial output to DFS is permanently lost if a process 
> fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial 
> SequenceFile-s will be corrupted at the end. This means that it won't be 
> possible to recover all data from them - most likely only the data up to the 
> last sync marker can be recovered.
> The recovery proces requires some preparation: 
> * determine the map directories corresponding to the map task outputs of the 
> failed job. These map directories contain SequenceFile-s consisting of pairs 
> of <Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s 
> into this directory, renaming them sequentially like this: 
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
>   
> * specify the "input" directory as the input to this tool. 
> If all goes well, a new segment will be created as a subdirectory of the 
> output dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to