Hello!

I would like to convert to String HTML contained in org.apache.nutch.protocol.Content class in the org.apache.nutch.segment.SegmentReader.reduce() method.

String htmlContent = new String(((Content)value).getContent(), "UTF-8");

Although the original HTML pages state that the encoding is UTF-8, the resulting HTML inside the string seems to be improper as I fail to build DOM out of it. What is the proper way of converting byte [] contained inside the Content class to String?

Thanks,
Regards

Reply via email to