cnauroth commented on pull request #3391: URL: https://github.com/apache/hadoop/pull/3391#issuecomment-918677709
@majdyz , thanks for refining the test case. Thanks also to @steveloughran and @ayushtkn for their testing. I can see the repro now. Interestingly, even though it repros on branch-3.3, it does not repro on branch-3.2. I suspect though that's not because branch-3.2 has correct Unicode handling, but rather that both the `mkdir` and `chmod` steps of this logic are doing the wrong thing, but they are at least in agreement on doing the same wrong thing. It looks like the end result on branch-3.2 is a successful call, but the new directory doesn't have the correct name. Here is the reason it's happening. The call performs the `mkdir` using Java classes and then performs the `chmod` through native code. I traced through the Java and native layers. I confirmed that the string passes correctly through all Java frames, and then inside the native layer it gets garbled. This is because the string data is fetched through the JNI `GetStringUTFChars` function, which is documented to return the JVM's internal "modified UTF-8" representation. The `chmod` call would work fine with true UTF-8, but there are edge cases where the modified UTF-8 would differ from the standard and not work correctly. The JavaDocs of [`DataInputFormat`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html) discuss the differences between standard UTF-8 and modified UTF-8. This is the key point relevant to this bug: > Only the 1-byte, 2-byte, and 3-byte formats are used. The character in your test case is an emoji that has a 4-byte standard UTF-8 representation, but modified UTF-8 chooses to represent it as 2 surrogate pairs of 3 bytes each. That encoding won't be compatible with the C APIs that we're calling. This problem is likely not unique to the `chmod` call. We use `GetStringUTFChars` in multiple places. Interestingly, it's possible that this bug doesn't repro at all on Windows, where the JNI code path uses `GetStringChars` instead to fetch the UTF-16 representation and convert to Windows wide-character string types. I no longer have easy access to a Windows test environment to confirm though. Now what might we do about this? Some ideas: 1. Explore a possible migration from native `chmod` to the new Java NIO File APIs, e.g. [`Files#setPosixFilePermissions`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/file/Files.html#setPosixFilePermissions(java.nio.file.Path,java.util.Set)). These native code paths were introduced before the JDK introduced APIs for permission changes. 2. Convert the code to use `GetStringChars` to fetch the UTF-16 bytes and pass them back to the JDK encoding conversion methods. That would give us a way to get to true UTF-8. This introduces a performance risk from extra buffer copying and extra transitions over the Java-JNI boundary. It's also probably a more confusing call flow. 3. Use `GetStringChars` to get UTF-16, but do the conversion to UTF-8 directly within the JNI layer using something like [`iconv`](https://linux.die.net/man/3/iconv). That would avoid some of the weirdness and performance penalties of transitioning back over the Java boundary. 4. Change the code to convert to a UTF-8 `byte[]` at the Java layer and pass that as-is to JNI. (Stay away from `jstring` entirely.) This would re-raise the debate of whether it's considered a backward-incompatible change if the interface between the Java layer and the JNI layer changes. It's probably natural for everyone to ask "isn't there some simpler way to get true UTF-8 out of a `jstring`?" I'm pretty sure there is no convenient function for that, but I'd love if someone proved me wrong. All of the solutions I can think of are fairly large in scope. Then, there is testing effort too of course. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
