"Again, this procedure does NOT work when using HDFS - you won't even see the partial output (without some serious hacking)"
Got it ! "You can simply set the fetcher.parsing config option to false." Found it ! Thanks for the help 2010/5/3 Andrzej Bialecki <a...@getopt.org> > On 2010-05-03 22:58, Emmanuel de Castro Santana wrote: > > "The first method of recovering that is mentioned there works only > > (sometimes) for crawls performed using LocalJobTracker and local file > > system. It does not work in any other case." > > > > if I stop the crawling process, take the crawled content from the dfs > into > > my local disk, do the fix and then put it back into hdfs, would it work ? > Or > > would there be a problem about dfs replication of the new files ? > > Again, this procedure does NOT work when using HDFS - you won't even see > the partial output (without some serious hacking). > > > > > "Yes, but if the parsing fails you still have the downloaded content, > > which you can re-parse again after you fixed the config or the code..." > > > > Interesting ... I did not see any option like a -noParsing in the > "bin/nutch > > crawl" command, that means I will have to code my own .sh for crawling, > one > > that uses the -noparsing option of the fetcher right ? > > You can simply set the fetcher.parsing config option to false. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Emmanuel de Castro Santana