Hi all,
I'm porting an application to MapReduce (currently hadoop v0.21.0) and encountered the following problem: My application implements a cache which contains data instances and implements the Writable interface. When hadoop calls the write-method to write the data which shall be passed to the mappers, the cache serializes each instance and writes them as Strings to the DataOutput. When reading the fields of the cache again, these strings are parsed and instances are created. This works, as long as the strings only contain utf-characters. If they contain characters from another codepage, I encode them as UTF-8 and transmit them. My problem now is, that the target cache contains all the instances but they cannot be compared properly because of the UTF-encoded characters -> there are similarities that are not present in the original data. So, does anyone have any suggestions on how to solve this problem? The best would be a method to transmit any string without changing its encoding. Thanks a lot, Stanley