Hello!

I would like to convert to String the crawled HTML contained in org.apache.nutch.protocol.Content class in the org.apache.nutch.segment.SegmentReader.reduce() method.

String htmlContent = new String(((Content)value).getContent(), "UTF-8");

Resulting HTML seems to be improper as I fail to build DOM out of it. The same happens with other encodings. What would be the proper way of converting byte [] contained inside the Content class to String? Would it be practical to modify Fetcher class to store the content as UTF-8 (where exactly)?

Thanks,
Regards

Reply via email to