On Tue, Jan 25, 2011 at 8:57 PM, Hitesh Jethwani <[email protected]>wrote:
> > if on the stream writer, I add something like: > > writer.write(new String(msg.getBytes(), "UTF8").getBytes()) instead of > > simply writer.write(msg.getBytes()), I see the characters as expected > > on the C++ client. However this I believe messes up with the protobuf > > headers, so on C++ I receive only a partial file upto the entry that > > contains one such character. > > Still not sure on the above though. > The reason this appears to work is because String.getBytes() encodes in ISO-8859-1 encoding by default. This encoding represents each character as exactly one byte, and can only represent character codes U+0000 through U+00FF. Since you are decoding the bytes as UTF-8 and then encoding them as ISO-8859-1, and since the character 'É' happens to be within the ISO-8859-1 range, you effectively decoded this character into a single byte. On the C++ side, the protobuf library does not verify that the parsed bytes are actually valid UTF-8 (except in debug mode); it just passes them through. So the string you see there includes the 'É' character as one byte. However, you end up getting a parser error because the length of the string (in bytes) ends up being different from the length given in the encoded message. The length was originally computed with 'É' represented as two bytes, but now it is only one byte, so the length is wrong. In general, decoding arbitrary bytes (like a protobuf) as if they were UTF-8 will lose information, so converting bytes -> UTF-8 -> bytes will corrupt the bytes. -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
