cnauroth commented on pull request #3391:
URL: https://github.com/apache/hadoop/pull/3391#issuecomment-918677709


   @majdyz , thanks for refining the test case.  Thanks also to @steveloughran 
and @ayushtkn for their testing.
   
   I can see the repro now.  Interestingly, even though it repros on 
branch-3.3, it does not repro on branch-3.2.  I suspect though that's not 
because branch-3.2 has correct Unicode handling, but rather that both the 
`mkdir` and `chmod` steps of this logic are doing the wrong thing, but they are 
at least in agreement on doing the same wrong thing.  It looks like the end 
result on branch-3.2 is a successful call, but the new directory doesn't have 
the correct name.
   
   Here is the reason it's happening.  The call performs the `mkdir` using Java 
classes and then performs the `chmod` through native code.  I traced through 
the Java and native layers.  I confirmed that the string passes correctly 
through all Java frames, and then inside the native layer it gets garbled.  
This is because the string data is fetched through the JNI `GetStringUTFChars` 
function, which is documented to return the JVM's internal "modified UTF-8" 
representation.  The `chmod` call would work fine with true UTF-8, but there 
are edge cases where the modified UTF-8 would differ from the standard and not 
work correctly.
   
   The JavaDocs of 
[`DataInputFormat`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html)
 discuss the differences between standard UTF-8 and modified UTF-8.  This is 
the key point relevant to this bug:
   
   > Only the 1-byte, 2-byte, and 3-byte formats are used.
   
   The character in your test case is an emoji that has a 4-byte standard UTF-8 
representation, but modified UTF-8 chooses to represent it as 2 surrogate pairs 
of 3 bytes each.  That encoding won't be compatible with the C APIs that we're 
calling.
   
   This problem is likely not unique to the `chmod` call.  We use 
`GetStringUTFChars` in multiple places.
   
   Interestingly, it's possible that this bug doesn't repro at all on Windows, 
where the JNI code path uses `GetStringChars` instead to fetch the UTF-16 
representation and convert to Windows wide-character string types.  I no longer 
have easy access to a Windows test environment to confirm though.
   
   Now what might we do about this?  Some ideas:
   1. Explore a possible migration from native `chmod` to the new Java NIO File 
APIs, e.g. 
[`Files#setPosixFilePermissions`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/file/Files.html#setPosixFilePermissions(java.nio.file.Path,java.util.Set)).
  These native code paths were introduced before the JDK introduced APIs for 
permission changes.
   2. Convert the code to use `GetStringChars` to fetch the UTF-16 bytes and 
pass them back to the JDK encoding conversion methods.  That would give us a 
way to get to true UTF-8.  This introduces a performance risk from extra buffer 
copying and extra transitions over the Java-JNI boundary.  It's also probably a 
more confusing call flow.
   3. Use `GetStringChars` to get UTF-16, but do the conversion to UTF-8 
directly within the JNI layer using something like 
[`iconv`](https://linux.die.net/man/3/iconv).  That would avoid some of the 
weirdness and performance penalties of transitioning back over the Java 
boundary.
   4. Change the code to convert to a UTF-8 `byte[]` at the Java layer and pass 
that as-is to JNI.  (Stay away from `jstring` entirely.)  This would re-raise 
the debate of whether it's considered a backward-incompatible change if the 
interface between the Java layer and the JNI layer changes.
   
   It's probably natural for everyone to ask "isn't there some simpler way to 
get true UTF-8 out of a `jstring`?"  I'm pretty sure there is no convenient 
function for that, but I'd love if someone proved me wrong.  All of the 
solutions I can think of are fairly large in scope.  Then, there is testing 
effort too of course.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to