Re: decoding nutch readseg -dump 's output

Andrzej Bialecki Mon, 16 Nov 2009 11:53:25 -0800

Yves Petinot wrote:

Hi,
I'm trying to build a small perl (could be any scripting language)utility that takes nutch readseg -dump 's output as its input, decodesthe content field to utf-8 (independent of what encoding the raw pagewas in) and outputs that decoded content. After a little bit ofexperimentation, i find myself unable to decode the content field, evenwhen i try using the various charset hints that are available either inthe content metadata, or in the raw content itself.
I was wondering if someone on the list has already succeeded in buildingthis type of functionality, or is the content returned by readseg usinga specific encoding that i don't know of ?

The dump functionality is not intended to provide a bit-by-bit copy ofthe segment, it's mostly for debugging purposes. It uses System.out,which in turn uses the default platform encoding - any charactersoutside this encoding will be replaced by question marks.

If you want to get an exact copy of the raw binary content then pleaseuse the SegmentReader API.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: decoding nutch readseg -dump 's output

Reply via email to