The reason to change is that returning UTF-8 is useful and returning “modified UTF-8” is apparently not (as no one has explained why it is useful).
Why not deprecate it? It would be nice to get a warning. Alan > On Jan 25, 2019, at 6:40 PM, David Holmes <david.hol...@oracle.com> wrote: > > On 26/01/2019 3:29 am, Alan Snyder wrote: >> My question was not about why it does what it does, but why it still does >> that. Is there a valid use of this primitive that depends upon it returning >> something other than true UTF-8? > > It still does what it does because that was how it was specified 20+ years > ago and there's been no reason to change. > >> It may not have been an issue to you, but it was to me when I discovered my >> program could not handle certain file names. I’ll bet I’m not the last >> person to assume that a primitive named GetStringUTFChars returns UTF. > > It does return chars in a UTF (Unicode transformation format) - that format > is a modified UTF-8 format. It isn't named GetStringUTF8Chars. > > The documentation is quite clear: > > GetStringUTFChars > > const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy); > > Returns a pointer to an array of bytes representing the string in modified > UTF-8 encoding. > > --- > > David > ----- > >> I have fixed my code, so its not an issue for me any more, but it seems like >> an unnecessary tarpit awaiting the unwary. >> Just my 2c. >> Alan >>> On Jan 24, 2019, at 10:04 PM, David Holmes <david.hol...@oracle.com> wrote: >>> >>> On 25/01/2019 4:39 am, Alan Snyder wrote: >>>> Thank you. That post does explain what is happening, but leaves open the >>>> question of whether GetStringUTFChars should be changed. >>>> What is the value of the current implementation of GetStringUTFChars >>>> versus one that returns true UTF-8? >>> >>> Well that's really a Hotspot question as it concerns JNI, but this is >>> ancient history. There's little point musing over the "why" of decisions >>> made back in the late 1990's. But I suspect the main reason is the >>> avoidance of embedded NUL characters. >>> >>> The only bug report I can see on this (basically the same issue you are >>> reporting) was back in 2004: >>> >>> https://bugs.openjdk.java.net/browse/JDK-5030776 >>> >>> so it simply has not been an issue. As per the SO article that Claes >>> referenced anyone needing true UTF8 has a couple of paths to achieve that. >>> >>> Cheers, >>> David >>> ----- >>> >>> >>>> Alan >>>>> On Jan 24, 2019, at 10:32 AM, Claes Redestad <claes.redes...@oracle.com> >>>>> wrote: >>>>> >>>>> Hi Alan, >>>>> >>>>> GetStringUTFChars unfortunately doesn't give you true UTF-8, but a >>>>> modified UTF-8 sequence >>>>> as used by the VM internally for historical reasons. >>>>> >>>>> See answers to this related question on SO (which contains links to >>>>> official docs): >>>>> https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni >>>>> >>>>> HTH >>>>> >>>>> /Claes >>>>> >>>>> On 2019-01-24 19:23, Alan Snyder wrote: >>>>>> I am having a problem with file names that contain emojis when passed to >>>>>> a macOS system call. >>>>>> >>>>>> Things work when I convert the path to bytes in Java, but fail (file not >>>>>> found) when I convert the path to bytes in native code using >>>>>> GetStringUTFChars. >>>>>> >>>>>> For example, where String.getBytes() returns >>>>>> >>>>>> -16 -97 -115 -69 >>>>>> >>>>>> GetStringUTFChars returns: >>>>>> >>>>>> -19 -96 -68 -19 -67 -69 >>>>>> >>>>>> I’m not a UTF expert, so can someone say whether I should file a bug >>>>>> report? >>>>>> >>>>>> (Tested in JDK 9, 11, and a fairly recent 12) >>>>>> >>>>> >>> >