Hi, Here is a link to an updated test case that simplifies the string being tested to just the problem character, and fixes a bug in determining the length of the array returned by GetStringUTFChars.
https://s3.amazonaws.com/com.voltdb.aweisberg/utf8_encoding_bug2.tgz Thanks, Ariel On Tue, Jun 5, 2012, at 11:38 AM, Ariel Weisberg wrote: > Hi all, > > Not sure what list this should go to. > > I found an issue with JNI's GetStringUTFChars which is supposed to > return a Java string in UTF-8 encoding. There is an attached test case. > I tested on Ubuntu 12.04 (Linux aweisberg-desktop 2.6.32-41-generic > #89-Ubuntu SMP Fri Apr 27 22:18:56 UTC 2012 x86_64 GNU/Linux) and CentOS > 5 (Linux volt3b 2.6.18-308.4.1.el5 #1 SMP Tue Apr 17 17:08:00 EDT 2012 > x86_64 x86_64 x86_64 GNU/Linux) with JDK 6 update 32 and JDK 7 update 4. > > For the following string "â🀲x一xxéyyԱ" I find that the first character is > encoded correctly, but the second character > (http://www.fileformat.info/info/unicode/char/1f032/index.htm) comes out > with an invalid code point. > > The result of String.getBytes("UTF-8") is > c3a2f09f80b278e4b8807878c3a97979d4b1 and this matches the output I get > from defining the string as a constant in C++. > > The result of GetStringUTFChars is c3a2eda0bcedb0b278e4b8. > > See this test case > (https://s3.amazonaws.com/com.voltdb.aweisberg/utf8_encoding_bug.tgz) > for a reproducer and how I displayed the values. > > Thanks, > Ariel