"The first method of recovering that is mentioned there works only (sometimes) for crawls performed using LocalJobTracker and local file system. It does not work in any other case."
if I stop the crawling process, take the crawled content from the dfs into my local disk, do the fix and then put it back into hdfs, would it work ? Or would there be a problem about dfs replication of the new files ? "Yes, but if the parsing fails you still have the downloaded content, which you can re-parse again after you fixed the config or the code..." Interesting ... I did not see any option like a -noParsing in the "bin/nutch crawl" command, that means I will have to code my own .sh for crawling, one that uses the -noparsing option of the fetcher right ? 2010/5/3 Andrzej Bialecki <a...@getopt.org> > On 2010-05-03 19:59, Emmanuel de Castro Santana wrote: > > "Unfortunately, no. You should at least crawl without parsing, so that > > when you download the content you can run the parsing separately, and > > repeat it if it fails." > > > > I've just found this in the FAQ, can it be done ? > > > > > http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F > > The first method of recovering that is mentioned there works only > (sometimes) for crawls performed using LocalJobTracker and local file > system. It does not work in any other case. > > > > > By the way, about not parsing, isn't necessary to parse the content > anyway > > in order to generate links for the next segment ? If this is true, one > would > > have to run parse separatedly, which would result the same. > > Yes, but if the parsing fails you still have the downloaded content, > which you can re-parse again after you fixed the config or the code... > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Emmanuel de Castro Santana