My usage of GetStringUTFChars was to pass a String as a parameter to a system call that takes a NUL-terminated UTF-8 string (a file path). Obviously, the system call does not accept strings containing NUL. I suspect this is a common case.
Therefore, my needs would be met by a (new) primitive that returns UTF-8 and fails if the String contains NUL. In addition, I would suggest either of these options: (1) Document GetStringUTFChars as deprecated, introduce a new primitive GetStringCharsInternalRepresentationModifiedUTF, and use C support for deprecated members where available to provide compile-time warnings when GetStringUTFChars is used. (2) Rename GetStringUTFChars to GetStringCharsInternalRepresentationModifiedUTF. I believe this is a binary compatible change, but new builds will fail, forcing developers to choose which behavior they really want. Alan > On Jan 26, 2019, at 7:30 AM, Claes Redestad <claes.redes...@oracle.com> wrote: > > Modified UTF-8 goes way back in terms of internal use in java and its > JVMs. It's the format used to store strings in class-files, and used as > an internal representation in the HotSpot VM: various internal string > tables, constant pools etc. > > So any Java code that interacts with the VM needs to know how to convert > back and forth between java Strings and the VMs flavor of modified > UTF-8. As long as the JVM speak modified UTF-8 internally, we'll need > the utilities to convert back and forth. Changing this fundamental > design is likely to be way more trouble than it's ever worth. > > As to "why do the VM do this!?", I'm too young to really know for sure, > but it's fun to speculate.[1] > > I think we all welcome constructive suggestions on how to help > developers notice that the "UTF" JNI methods aren't what your intuition > might tell you. I've been there myself and learned about modified UTF-8 > the hard way. > > /Claes > > [1] > > It turns out there are a few obvious technical difficulties with UTF-8, > especially dealing with strings that encode '\0' characters (a.k.a. > null) in the context of C/C++ code. C-strings (char*) are null- > terminated, and there's a lot of code and utilities that'd break or > behave weirdly if you give them char*s with embedded nulls in them... > > But UTF-8 is still mostly an attractive, compact encoding for the kind > of strings JVMs care about: most of them are ASCII String literals for > methods and fields encoded into classfiles, and UTF-8 encode ASCII > without any overhead! > > But it allows null chars, and to support that you need to encode the > length.. Ugh, overhead! Can't have that! What to do?! > > The designers likely thought it'd be less trouble modifying this new, > shiny UTF-8 encoding to get something similar to it that disallows > embedded nulls. And why not: it's only for a Java/JVM-internal stuff no- > one on the outside needs to know about, right? And it's *mostly* > compatible. And no-one uses real UTF-8, anyhow! > > The context here is that Unicode and UTF-8 was still relatively new > (RFCs filed 1993 and 1996, respectively). The fact that it'd eventually > become the de facto encoding standard was not something anyone could > have known back then. > > As it happens, "modified UTF-8" took root in the emerging world of JVMs, > and spread to a number of surprising places throughout the Java SE > libraries, like java.io.DataInput/Output: > https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#modified-utf-8 > > > Today, in C++14, std::string supports embedding '\0' values, and is thus > much more UTF-8 friendly than good old C-strings. I find it unlikely > that "modified UTF-8" would be a thing if the JVM was designed from > scratch today ("C++ in 2019!?"). > > On 2019-01-26 05:24, Alan Snyder wrote: >> The reason to change is that returning UTF-8 is useful and returning >> “modified UTF-8” is apparently not (as no one has explained why it is >> useful). >> Why not deprecate it? >> It would be nice to get a warning. >> Alan >>> On Jan 25, 2019, at 6:40 PM, David Holmes <david.hol...@oracle.com> wrote: >>> >>> On 26/01/2019 3:29 am, Alan Snyder wrote: >>>> My question was not about why it does what it does, but why it still does >>>> that. Is there a valid use of this primitive that depends upon it >>>> returning something other than true UTF-8? >>> >>> It still does what it does because that was how it was specified 20+ years >>> ago and there's been no reason to change. >>> >>>> It may not have been an issue to you, but it was to me when I discovered >>>> my program could not handle certain file names. I’ll bet I’m not the last >>>> person to assume that a primitive named GetStringUTFChars returns UTF. >>> >>> It does return chars in a UTF (Unicode transformation format) - that format >>> is a modified UTF-8 format. It isn't named GetStringUTF8Chars. >>> >>> The documentation is quite clear: >>> >>> GetStringUTFChars >>> >>> const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean >>> *isCopy); >>> >>> Returns a pointer to an array of bytes representing the string in modified >>> UTF-8 encoding. >>> >>> --- >>> >>> David >>> ----- >>> >>>> I have fixed my code, so its not an issue for me any more, but it seems >>>> like an unnecessary tarpit awaiting the unwary. >>>> Just my 2c. >>>> Alan >>>>> On Jan 24, 2019, at 10:04 PM, David Holmes <david.hol...@oracle.com> >>>>> wrote: >>>>> >>>>> On 25/01/2019 4:39 am, Alan Snyder wrote: >>>>>> Thank you. That post does explain what is happening, but leaves open the >>>>>> question of whether GetStringUTFChars should be changed. >>>>>> What is the value of the current implementation of GetStringUTFChars >>>>>> versus one that returns true UTF-8? >>>>> >>>>> Well that's really a Hotspot question as it concerns JNI, but this is >>>>> ancient history. There's little point musing over the "why" of decisions >>>>> made back in the late 1990's. But I suspect the main reason is the >>>>> avoidance of embedded NUL characters. >>>>> >>>>> The only bug report I can see on this (basically the same issue you are >>>>> reporting) was back in 2004: >>>>> >>>>> https://bugs.openjdk.java.net/browse/JDK-5030776 >>>>> >>>>> so it simply has not been an issue. As per the SO article that Claes >>>>> referenced anyone needing true UTF8 has a couple of paths to achieve that. >>>>> >>>>> Cheers, >>>>> David >>>>> ----- >>>>> >>>>> >>>>>> Alan >>>>>>> On Jan 24, 2019, at 10:32 AM, Claes Redestad >>>>>>> <claes.redes...@oracle.com> wrote: >>>>>>> >>>>>>> Hi Alan, >>>>>>> >>>>>>> GetStringUTFChars unfortunately doesn't give you true UTF-8, but a >>>>>>> modified UTF-8 sequence >>>>>>> as used by the VM internally for historical reasons. >>>>>>> >>>>>>> See answers to this related question on SO (which contains links to >>>>>>> official docs): >>>>>>> https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> /Claes >>>>>>> >>>>>>> On 2019-01-24 19:23, Alan Snyder wrote: >>>>>>>> I am having a problem with file names that contain emojis when passed >>>>>>>> to a macOS system call. >>>>>>>> >>>>>>>> Things work when I convert the path to bytes in Java, but fail (file >>>>>>>> not found) when I convert the path to bytes in native code using >>>>>>>> GetStringUTFChars. >>>>>>>> >>>>>>>> For example, where String.getBytes() returns >>>>>>>> >>>>>>>> -16 -97 -115 -69 >>>>>>>> >>>>>>>> GetStringUTFChars returns: >>>>>>>> >>>>>>>> -19 -96 -68 -19 -67 -69 >>>>>>>> >>>>>>>> I’m not a UTF expert, so can someone say whether I should file a bug >>>>>>>> report? >>>>>>>> >>>>>>>> (Tested in JDK 9, 11, and a fairly recent 12) >>>>>>>> >>>>>>> >>>>> >>> >