In the any segment there will be a folder called parse_data which is
{Text, ParseData}. You can write a simple sequence file printer
class to print all of the data for you using the hadoop API.
--n
On Dec 13, 2007, at 8:53 AM, qa_nutch wrote:
I am using nutch 0.9
qa_nutch wrote:
Hello..I am new to nutch,I have read the basics.I wanted to access
the
parsed html text of each url (seperately) from the segment .I can
then use
each of those parsed text files for other nlp task such as tagging
and
named entity recognition.Using segment dump gave me a lot of
information
together :parsed text,links html etc .So I wish to obtain the
parsed text
of the html corresponding to each url in the linkdb seperately.Is
this
possible?
--
View this message in context: http://www.nabble.com/html-parse-text-
tp14319904p14319916.html
Sent from the Nutch - User mailing list archive at Nabble.com.