Re: html parse text

Ned Rockson Thu, 13 Dec 2007 10:56:44 -0800

In the any segment there will be a folder called parse_data which is{Text, ParseData}. You can write a simple sequence file printerclass to print all of the data for you using the hadoop API.

--n


On Dec 13, 2007, at 8:53 AM, qa_nutch wrote:

I am using nutch 0.9



qa_nutch wrote:
Hello..I am new to nutch,I have read the basics.I wanted to accesstheparsed html text of each url (seperately) from the segment .I canthen useeach of those parsed text files for other nlp task such as taggingandnamed entity recognition.Using segment dump gave me a lot ofinformationtogether :parsed text,links html etc .So I wish to obtain theparsed textof the html corresponding to each url in the linkdb seperately.Isthis
possible?
--
View this message in context: http://www.nabble.com/html-parse-text-tp14319904p14319916.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: html parse text

Reply via email to