Re: R: Using Nutch for only retriving HTML

Andrzej Bialecki Wed, 30 Sep 2009 14:39:14 -0700

BELLINI ADAM wrote:


me again,

i forgot to tell u the easiest way...

once the crawl is finished you can dump the whole db (it contains all the links 
to your html pages) in a text file..

./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile

and you can perfor the wget on this db and archive the files

I'd argue with this advice. The goal here is to obtain the HTML pages.If you have crawled them, then why do it again? You already have theircontent locally.

However, page content is NOT stored in crawldb, it's stored in segments.So you need to dump the content from segments, and not the content ofcrawldb.

The command 'bin/nutch readseg -dump <segmentName> <output>' should dothe trick.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: R: Using Nutch for only retriving HTML

Reply via email to