Re: possible problem with JNI GetStringUTFChars

Alan Snyder Fri, 25 Jan 2019 20:28:39 -0800

The reason to change is that returning UTF-8 is useful and returning “modified 
UTF-8” is apparently not (as no one has explained why it is useful).


Why not deprecate it?

It would be nice to get a warning.

  Alan


> On Jan 25, 2019, at 6:40 PM, David Holmes <david.hol...@oracle.com> wrote:
> 
> On 26/01/2019 3:29 am, Alan Snyder wrote:
>> My question was not about why it does what it does, but why it still does 
>> that. Is there a valid use of this primitive that depends upon it returning 
>> something other than true UTF-8?
> 
> It still does what it does because that was how it was specified 20+ years 
> ago and there's been no reason to change.
> 
>> It may not have been an issue to you, but it was to me when I discovered my 
>> program could not handle certain file names. I’ll bet I’m not the last 
>> person to assume that a primitive named GetStringUTFChars returns UTF.
> 
> It does return chars in a UTF (Unicode transformation format) - that format 
> is a modified UTF-8 format. It isn't named GetStringUTF8Chars.
> 
> The documentation is quite clear:
> 
> GetStringUTFChars
> 
> const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
> 
> Returns a pointer to an array of bytes representing the string in modified 
> UTF-8 encoding.
> 
> ---
> 
> David
> -----
> 
>> I have fixed my code, so its not an issue for me any more, but it seems like 
>> an unnecessary tarpit awaiting the unwary.
>> Just my 2c.
>>   Alan
>>> On Jan 24, 2019, at 10:04 PM, David Holmes <david.hol...@oracle.com> wrote:
>>> 
>>> On 25/01/2019 4:39 am, Alan Snyder wrote:
>>>> Thank you. That post does explain what is happening, but leaves open the 
>>>> question of whether GetStringUTFChars should be changed.
>>>> What is the value of the current implementation of GetStringUTFChars 
>>>> versus one that returns true UTF-8?
>>> 
>>> Well that's really a Hotspot question as it concerns JNI, but this is 
>>> ancient history. There's little point musing over the "why" of decisions 
>>> made back in the late 1990's. But I suspect the main reason is the 
>>> avoidance of embedded NUL characters.
>>> 
>>> The only bug report I can see on this (basically the same issue you are 
>>> reporting) was back in 2004:
>>> 
>>> https://bugs.openjdk.java.net/browse/JDK-5030776
>>> 
>>> so it simply has not been an issue. As per the SO article that Claes 
>>> referenced anyone needing true UTF8 has a couple of paths to achieve that.
>>> 
>>> Cheers,
>>> David
>>> -----
>>> 
>>> 
>>>>   Alan
>>>>> On Jan 24, 2019, at 10:32 AM, Claes Redestad <claes.redes...@oracle.com> 
>>>>> wrote:
>>>>> 
>>>>> Hi Alan,
>>>>> 
>>>>> GetStringUTFChars unfortunately doesn't give you true UTF-8, but a 
>>>>> modified UTF-8 sequence
>>>>> as used by the VM internally for historical reasons.
>>>>> 
>>>>> See answers to this related question on SO (which contains links to 
>>>>> official docs):
>>>>> https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
>>>>> 
>>>>> HTH
>>>>> 
>>>>> /Claes
>>>>> 
>>>>> On 2019-01-24 19:23, Alan Snyder wrote:
>>>>>> I am having a problem with file names that contain emojis when passed to 
>>>>>> a macOS system call.
>>>>>> 
>>>>>> Things work when I convert the path to bytes in Java, but fail (file not 
>>>>>> found) when I convert the path to bytes in native code using 
>>>>>> GetStringUTFChars.
>>>>>> 
>>>>>> For example, where String.getBytes() returns
>>>>>> 
>>>>>> -16 -97 -115 -69
>>>>>> 
>>>>>> GetStringUTFChars returns:
>>>>>> 
>>>>>> -19 -96 -68 -19 -67 -69
>>>>>> 
>>>>>> I’m not a UTF expert, so can someone say whether I should file a bug 
>>>>>> report?
>>>>>> 
>>>>>> (Tested in JDK 9, 11, and a fairly recent 12)
>>>>>> 
>>>>> 
>>> 
>

Re: possible problem with JNI GetStringUTFChars

Reply via email to