On 26/01/2019 3:29 am, Alan Snyder wrote:
My question was not about why it does what it does, but why it still does that. 
Is there a valid use of this primitive that depends upon it returning something 
other than true UTF-8?

It still does what it does because that was how it was specified 20+ years ago and there's been no reason to change.

It may not have been an issue to you, but it was to me when I discovered my 
program could not handle certain file names. I’ll bet I’m not the last person 
to assume that a primitive named GetStringUTFChars returns UTF.

It does return chars in a UTF (Unicode transformation format) - that format is a modified UTF-8 format. It isn't named GetStringUTF8Chars.

The documentation is quite clear:

GetStringUTFChars

const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);

Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding.

---

David
-----

I have fixed my code, so its not an issue for me any more, but it seems like an 
unnecessary tarpit awaiting the unwary.

Just my 2c.

   Alan


On Jan 24, 2019, at 10:04 PM, David Holmes <david.hol...@oracle.com> wrote:

On 25/01/2019 4:39 am, Alan Snyder wrote:
Thank you. That post does explain what is happening, but leaves open the 
question of whether GetStringUTFChars should be changed.
What is the value of the current implementation of GetStringUTFChars versus one 
that returns true UTF-8?

Well that's really a Hotspot question as it concerns JNI, but this is ancient history. 
There's little point musing over the "why" of decisions made back in the late 
1990's. But I suspect the main reason is the avoidance of embedded NUL characters.

The only bug report I can see on this (basically the same issue you are 
reporting) was back in 2004:

https://bugs.openjdk.java.net/browse/JDK-5030776

so it simply has not been an issue. As per the SO article that Claes referenced 
anyone needing true UTF8 has a couple of paths to achieve that.

Cheers,
David
-----


   Alan
On Jan 24, 2019, at 10:32 AM, Claes Redestad <claes.redes...@oracle.com> wrote:

Hi Alan,

GetStringUTFChars unfortunately doesn't give you true UTF-8, but a modified 
UTF-8 sequence
as used by the VM internally for historical reasons.

See answers to this related question on SO (which contains links to official 
docs):
https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni

HTH

/Claes

On 2019-01-24 19:23, Alan Snyder wrote:
I am having a problem with file names that contain emojis when passed to a 
macOS system call.

Things work when I convert the path to bytes in Java, but fail (file not found) 
when I convert the path to bytes in native code using GetStringUTFChars.

For example, where String.getBytes() returns

-16 -97 -115 -69

GetStringUTFChars returns:

-19 -96 -68 -19 -67 -69

I’m not a UTF expert, so can someone say whether I should file a bug report?

(Tested in JDK 9, 11, and a fairly recent 12)




Reply via email to