In the any segment there will be a folder called parse_data which is {Text, ParseData}. You can write a simple sequence file printer class to print all of the data for you using the hadoop API.

--n

On Dec 13, 2007, at 8:53 AM, qa_nutch wrote:


I am using nutch 0.9



qa_nutch wrote:

Hello..I am new to nutch,I have read the basics.I wanted to access the parsed html text of each url (seperately) from the segment .I can then use each of those parsed text files for other nlp task such as tagging and named entity recognition.Using segment dump gave me a lot of information together :parsed text,links html etc .So I wish to obtain the parsed text of the html corresponding to each url in the linkdb seperately.Is this
possible?


--
View this message in context: http://www.nabble.com/html-parse-text- tp14319904p14319916.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Reply via email to