"The first method of recovering that is mentioned there works only
(sometimes) for crawls performed using LocalJobTracker and local file
system. It does not work in any other case."

if I stop the crawling process, take the crawled content from the dfs into
my local disk, do the fix and then put it back into hdfs, would it work ? Or
would there be a problem about dfs replication of the new files ?

"Yes, but if the parsing fails you still have the downloaded content,
which you can re-parse again after you fixed the config or the code..."

Interesting ... I did not see any option like a -noParsing in the "bin/nutch
crawl" command, that means I will have to code my own .sh for crawling, one
that uses the -noparsing option of the fetcher right ?


2010/5/3 Andrzej Bialecki <a...@getopt.org>

> On 2010-05-03 19:59, Emmanuel de Castro Santana wrote:
> > "Unfortunately, no. You should at least crawl without parsing, so that
> > when you download the content you can run the parsing separately, and
> > repeat it if it fails."
> >
> > I've just found this in the FAQ, can it be done ?
> >
> >
> http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F
>
> The first method of recovering that is mentioned there works only
> (sometimes) for crawls performed using LocalJobTracker and local file
> system. It does not work in any other case.
>
> >
> > By the way, about not parsing, isn't necessary to parse the content
> anyway
> > in order to generate links for the next segment ? If this is true, one
> would
> > have to run parse separatedly, which would result the same.
>
> Yes, but if the parsing fails you still have the downloaded content,
> which you can re-parse again after you fixed the config or the code...
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Emmanuel de Castro Santana

Reply via email to