Re: How to use Nutch to parse Web-pages!

Tomislav Poljak Wed, 16 Jan 2008 02:16:03 -0800

Hi,
I think the simplest way to get parsed text from segment (Nutch stores
parse text in segment, for example :
crawl/segments/20080107120936/parse_text) to text file is dump option of
segment reader:


bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent
-nofetch -nogenerate -noparse -noparsedata

This will store only parsed text (recno/url/parsetext) from web pages
(but all in one file). If you need more control look at the source of
segment reader: org.apache.nutch.segment.SegmentReader

Hope this helps,

Tomislav


On Tue, 2008-01-15 at 11:46 -0800, Morrowwind wrote:
> Hi,
> 
> My project is about web page processing and I need to parse the web-pages to
> get all the plain text first. 
> 
> Now I have finished the crawling part using nutch, and I'm in trouble with
> the parsing part. I have my data in crawldb folder. How can I parse the
> plain text out of the web pages and store them in a .txt file? 
> 
> Could anyone give me a hint please. 
> 
> Thanks a lot.
> 
>

Re: How to use Nutch to parse Web-pages!

Reply via email to