Yes, I'm running the fetch job in parsing mode. I've tried running it with the -noParsing option on a single machine, but always ran into problems parsing. I've found that if I try to parse separately every page throws an error due to the content type being empty. I'm not sure what that's about, it obviously works for some people, but I've never had the parser correctly identify the content type on a single page when I run it separately. I'll have to give it a try on DFS to see if it's any different.
On 9/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > On 9/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > >> Tim Gautier wrote: > >>> I ran a fetch on a fetch list of around 3 million urls and it has > >>> failed on a single reduce task. Is there any way to recover the data > >>> that's been pulled down already? It's my understanding that the pages > >>> have all been pulled down to disk at this point and since it takes 3 > >>> days to pull them down, I'd really like to avoid doing it again. > >>> > >>> > >> Did you use DFS, or did you run this on a single machine? > > Tim Gautier wrote: > > I used DFS. > > > > Then your data is probably lost - sorry. The only possible recovery path > requires that you immediately shut down the jobtracker and tasktrackers, > and then move map outputs to a different place. However, if you kept the > jobtracker and tasktrackers running, then by now they've already deleted > map outputs. > > If you had the map outputs, then you would have to process them by > applying just the reduce part of the fetcher (which requires writing > some code). > > All in all, I think it's easier to re-run the fetch job. By the way, are > you running the fetcher in parsing mode? This is by far the most common > reason for failed reduce tasks. I strongly recommend running fetcher > with -noParsing flag, and running a parse job separately. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
