BELLINI ADAM wrote:
me again, i forgot to tell u the easiest way... once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file.. ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile and you can perfor the wget on this db and archive the files
I'd argue with this advice. The goal here is to obtain the HTML pages. If you have crawled them, then why do it again? You already have their content locally.
However, page content is NOT stored in crawldb, it's stored in segments. So you need to dump the content from segments, and not the content of crawldb.
The command 'bin/nutch readseg -dump <segmentName> <output>' should do the trick.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com