Tool to recover partial fetcher output
--------------------------------------

                 Key: NUTCH-451
                 URL: https://issues.apache.org/jira/browse/NUTCH-451
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.9.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 
             Fix For: 0.9.0
         Attachments: LocalFetchRecover.java

This class may help you to recover partial data from a failed Fetcher run. 

NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you 
didn't use DFS - partial output to DFS is permanently lost if a process fails 
to properly close the output streams.

NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial 
SequenceFile-s will be corrupted at the end. This means that it won't be 
possible to recover all data from them - most likely only the data up to the 
last sync marker can be recovered.

The recovery proces requires some preparation: 

* determine the map directories corresponding to the map task outputs of the 
failed job. These map directories contain SequenceFile-s consisting of pairs of 
<Text, FetcherOutput>, named e.g. part-0.out, or file.out, or spill0.out.

* create the new input directory, let's say input/. Copy all SequenceFile-s 
into this directory, renaming them sequentially like this: 
  input/part-00000
  input/part-00001
  input/part-00002
  input/part-00003
  ...
  
* specify the "input" directory as the input to this tool. 

If all goes well, a new segment will be created as a subdirectory of the output 
dir.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to