Fwd: Converting HTML text in org.apache.nutch.protocol.Content to String

byte array Mon, 12 Aug 2013 10:45:10 -0700

Hello!

I would like to convert to String the crawled HTML contained inorg.apache.nutch.protocol.Content class in theorg.apache.nutch.segment.SegmentReader.reduce() method.


String htmlContent = new String(((Content)value).getContent(), "UTF-8");

Resulting HTML seems to be improper as I fail to build DOM out of it.The same happens with other encodings. What would be the proper way ofconverting byte [] contained inside the Content class to String? Wouldit be practical to modify Fetcher class to store the content as UTF-8(where exactly)?


Thanks,
Regards

Fwd: Converting HTML text in org.apache.nutch.protocol.Content to String

Reply via email to