Thanks Tomislav! Your reply is a big help!
Tomislav Poljak wrote: > > Hi, > I think the simplest way to get parsed text from segment (Nutch stores > parse text in segment, for example : > crawl/segments/20080107120936/parse_text) to text file is dump option of > segment reader: > > bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent > -nofetch -nogenerate -noparse -noparsedata > > This will store only parsed text (recno/url/parsetext) from web pages > (but all in one file). If you need more control look at the source of > segment reader: org.apache.nutch.segment.SegmentReader > > Hope this helps, > > Tomislav > > > On Tue, 2008-01-15 at 11:46 -0800, Morrowwind wrote: >> Hi, >> >> My project is about web page processing and I need to parse the web-pages >> to >> get all the plain text first. >> >> Now I have finished the crawling part using nutch, and I'm in trouble >> with >> the parsing part. I have my data in crawldb folder. How can I parse the >> plain text out of the web pages and store them in a .txt file? >> >> Could anyone give me a hint please. >> >> Thanks a lot. >> >> > > > -- View this message in context: http://www.nabble.com/How-to-use-Nutch-to-parse-Web-pages%21-tp14845212p14929821.html Sent from the Nutch - User mailing list archive at Nabble.com.
