Yves Petinot wrote:
Hi,

I'm trying to build a small perl (could be any scripting language) utility that takes nutch readseg -dump 's output as its input, decodes the content field to utf-8 (independent of what encoding the raw page was in) and outputs that decoded content. After a little bit of experimentation, i find myself unable to decode the content field, even when i try using the various charset hints that are available either in the content metadata, or in the raw content itself.

I was wondering if someone on the list has already succeeded in building this type of functionality, or is the content returned by readseg using a specific encoding that i don't know of ?

The dump functionality is not intended to provide a bit-by-bit copy of the segment, it's mostly for debugging purposes. It uses System.out, which in turn uses the default platform encoding - any characters outside this encoding will be replaced by question marks.

If you want to get an exact copy of the raw binary content then please use the SegmentReader API.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to