On 9/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Tim Gautier wrote:
I ran a fetch on a fetch list of around 3 million urls and it has
failed on a single reduce task. Is there any way to recover the data
that's been pulled down already? It's my understanding that the pages
have all been pulled down to disk at this point and since it takes 3
days to pull them down, I'd really like to avoid doing it again.
Did you use DFS, or did you run this on a single machine?
Tim Gautier wrote:
> I used DFS.
>
Then your data is probably lost - sorry. The only possible recovery path
requires that you immediately shut down the jobtracker and tasktrackers,
and then move map outputs to a different place. However, if you kept the
jobtracker and tasktrackers running, then by now they've already deleted
map outputs.
If you had the map outputs, then you would have to process them by
applying just the reduce part of the fetcher (which requires writing
some code).
All in all, I think it's easier to re-run the fetch job. By the way, are
you running the fetcher in parsing mode? This is by far the most common
reason for failed reduce tasks. I strongly recommend running fetcher
with -noParsing flag, and running a parse job separately.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com