On 9/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Tim Gautier wrote:
I ran a fetch on a fetch list of around 3 million urls and it has
failed on a single reduce task.  Is there any way to recover the data
that's been pulled down already?  It's my understanding that the pages
have all been pulled down to disk at this point and since it takes 3
days to pull them down, I'd really like to avoid doing it again.


Did you use DFS, or did you run this on a single machine?

Tim Gautier wrote:
> I used DFS.
>

Then your data is probably lost - sorry. The only possible recovery path requires that you immediately shut down the jobtracker and tasktrackers, and then move map outputs to a different place. However, if you kept the jobtracker and tasktrackers running, then by now they've already deleted map outputs.

If you had the map outputs, then you would have to process them by applying just the reduce part of the fetcher (which requires writing some code).

All in all, I think it's easier to re-run the fetch job. By the way, are you running the fetcher in parsing mode? This is by far the most common reason for failed reduce tasks. I strongly recommend running fetcher with -noParsing flag, and running a parse job separately.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to