Is there a chance to pass Unicode characters to mappers via Writable?

Stanley Hillner Thu, 23 Dec 2010 04:01:40 -0800

Hi all,


I'm porting an application to MapReduce (currently hadoop v0.21.0) and
encountered the following problem:

 

My application implements a cache which contains data instances and
implements the Writable interface.

When hadoop calls the write-method to write the data which shall be passed
to the mappers, the cache serializes each instance and writes them as
Strings to the DataOutput.

When reading the fields of the cache again, these strings are parsed and
instances are created.

This works, as long as the strings only contain utf-characters.

If they contain characters from another codepage, I encode them as UTF-8 and
transmit them.

My problem now is, that the target cache contains all the instances but they
cannot be compared properly because of the UTF-encoded characters -> there
are similarities that are not present in the original data.

 

So, does anyone have any suggestions on how to solve this problem?

The best would be a method to transmit any string without changing its
encoding.

 

Thanks a lot,

Stanley

Is there a chance to pass Unicode characters to mappers via Writable?

Reply via email to