Hello!
I would like to convert to String the crawled HTML contained in
org.apache.nutch.protocol.Content class in the
org.apache.nutch.segment.SegmentReader.reduce() method.
String htmlContent = new String(((Content)value).getContent(), "UTF-8");
Resulting HTML seems to be improper as I fail to build DOM out of it.
The same happens with other encodings. What would be the proper way of
converting byte [] contained inside the Content class to String? Would
it be practical to modify Fetcher class to store the content as UTF-8
(where exactly)?
Thanks,
Regards