Hi, I think the simplest way to get parsed text from segment (Nutch stores parse text in segment, for example : crawl/segments/20080107120936/parse_text) to text file is dump option of segment reader:
bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent -nofetch -nogenerate -noparse -noparsedata This will store only parsed text (recno/url/parsetext) from web pages (but all in one file). If you need more control look at the source of segment reader: org.apache.nutch.segment.SegmentReader Hope this helps, Tomislav On Tue, 2008-01-15 at 11:46 -0800, Morrowwind wrote: > Hi, > > My project is about web page processing and I need to parse the web-pages to > get all the plain text first. > > Now I have finished the crawling part using nutch, and I'm in trouble with > the parsing part. I have my data in crawldb folder. How can I parse the > plain text out of the web pages and store them in a .txt file? > > Could anyone give me a hint please. > > Thanks a lot. > >
