Converting HTML text in org.apache.nutch.protocol.Content to String

byte array Mon, 12 Aug 2013 10:00:19 -0700

Hello!

I would like to convert to String HTML contained inorg.apache.nutch.protocol.Content class in theorg.apache.nutch.segment.SegmentReader.reduce() method.


String htmlContent = new String(((Content)value).getContent(), "UTF-8");

Although the original HTML pages state that the encoding is UTF-8, theresulting HTML inside the string seems to be improper as I fail to buildDOM out of it. What is the proper way of converting byte [] containedinside the Content class to String?


Thanks,
Regards

Converting HTML text in org.apache.nutch.protocol.Content to String

Reply via email to