Thanks Tomislav!  Your reply is a big help!


Tomislav Poljak wrote:
> 
> Hi,
> I think the simplest way to get parsed text from segment (Nutch stores
> parse text in segment, for example :
> crawl/segments/20080107120936/parse_text) to text file is dump option of
> segment reader:
> 
> bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent
> -nofetch -nogenerate -noparse -noparsedata
> 
> This will store only parsed text (recno/url/parsetext) from web pages
> (but all in one file). If you need more control look at the source of
> segment reader: org.apache.nutch.segment.SegmentReader
> 
> Hope this helps,
> 
> Tomislav
> 
> 
> On Tue, 2008-01-15 at 11:46 -0800, Morrowwind wrote:
>> Hi,
>> 
>> My project is about web page processing and I need to parse the web-pages
>> to
>> get all the plain text first. 
>> 
>> Now I have finished the crawling part using nutch, and I'm in trouble
>> with
>> the parsing part. I have my data in crawldb folder. How can I parse the
>> plain text out of the web pages and store them in a .txt file? 
>> 
>> Could anyone give me a hint please. 
>> 
>> Thanks a lot.
>> 
>> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-use-Nutch-to-parse-Web-pages%21-tp14845212p14929821.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to