BELLINI ADAM wrote:

me again,

i forgot to tell u the easiest way...

once the crawl is finished you can dump the whole db (it contains all the links 
to your html pages) in a text file..

./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile

and you can perfor the wget on this db and archive the files

I'd argue with this advice. The goal here is to obtain the HTML pages. If you have crawled them, then why do it again? You already have their content locally.

However, page content is NOT stored in crawldb, it's stored in segments. So you need to dump the content from segments, and not the content of crawldb.

The command 'bin/nutch readseg -dump <segmentName> <output>' should do the trick.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to