[
https://issues.apache.org/jira/browse/HADOOP-17895?focusedWorklogId=650279&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-650279
]
ASF GitHub Bot logged work on HADOOP-17895:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 14/Sep/21 00:10
Start Date: 14/Sep/21 00:10
Worklog Time Spent: 10m
Work Description: cnauroth commented on pull request #3391:
URL: https://github.com/apache/hadoop/pull/3391#issuecomment-918677709
@majdyz , thanks for refining the test case. Thanks also to @steveloughran
and @ayushtkn for their testing.
I can see the repro now. Interestingly, even though it repros on
branch-3.3, it does not repro on branch-3.2. I suspect though that's not
because branch-3.2 has correct Unicode handling, but rather that both the
`mkdir` and `chmod` steps of this logic are doing the wrong thing, but they are
at least in agreement on doing the same wrong thing. It looks like the end
result on branch-3.2 is a successful call, but the new directory doesn't have
the correct name.
Here is the reason it's happening. The call performs the `mkdir` using Java
classes and then performs the `chmod` through native code. I traced through
the Java and native layers. I confirmed that the string passes correctly
through all Java frames, and then inside the native layer it gets garbled.
This is because the string data is fetched through the JNI `GetStringUTFChars`
function, which is documented to return the JVM's internal "modified UTF-8"
representation. The `chmod` call would work fine with true UTF-8, but there
are edge cases where the modified UTF-8 would differ from the standard and not
work correctly.
The JavaDocs of
[`DataInputFormat`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html)
discuss the differences between standard UTF-8 and modified UTF-8. This is
the key point relevant to this bug:
> Only the 1-byte, 2-byte, and 3-byte formats are used.
The character in your test case is an emoji that has a 4-byte standard UTF-8
representation, but modified UTF-8 chooses to represent it as 2 surrogate pairs
of 3 bytes each. That encoding won't be compatible with the C APIs that we're
calling.
This problem is likely not unique to the `chmod` call. We use
`GetStringUTFChars` in multiple places.
Interestingly, it's possible that this bug doesn't repro at all on Windows,
where the JNI code path uses `GetStringChars` instead to fetch the UTF-16
representation and convert to Windows wide-character string types. I no longer
have easy access to a Windows test environment to confirm though.
Now what might we do about this? Some ideas:
1. Explore a possible migration from native `chmod` to the new Java NIO File
APIs, e.g.
[`Files#setPosixFilePermissions`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/file/Files.html#setPosixFilePermissions(java.nio.file.Path,java.util.Set)).
These native code paths were introduced before the JDK introduced APIs for
permission changes.
2. Convert the code to use `GetStringChars` to fetch the UTF-16 bytes and
pass them back to the JDK encoding conversion methods. That would give us a
way to get to true UTF-8. This introduces a performance risk from extra buffer
copying and extra transitions over the Java-JNI boundary. It's also probably a
more confusing call flow.
3. Use `GetStringChars` to get UTF-16, but do the conversion to UTF-8
directly within the JNI layer using something like
[`iconv`](https://linux.die.net/man/3/iconv). That would avoid some of the
weirdness and performance penalties of transitioning back over the Java
boundary.
4. Change the code to convert to a UTF-8 `byte[]` at the Java layer and pass
that as-is to JNI. (Stay away from `jstring` entirely.) This would re-raise
the debate of whether it's considered a backward-incompatible change if the
interface between the Java layer and the JNI layer changes.
It's probably natural for everyone to ask "isn't there some simpler way to
get true UTF-8 out of a `jstring`?" I'm pretty sure there is no convenient
function for that, but I'd love if someone proved me wrong. All of the
solutions I can think of are fairly large in scope. Then, there is testing
effort too of course.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 650279)
Time Spent: 2h (was: 1h 50m)
> RawLocalFileSystem mkdirs with unicode filename doesn't work in Hadoop 3
> ------------------------------------------------------------------------
>
> Key: HADOOP-17895
> URL: https://issues.apache.org/jira/browse/HADOOP-17895
> Project: Hadoop Common
> Issue Type: Bug
> Components: common, fs
> Affects Versions: 3.3.1
> Reporter: Zamil Majdy
> Priority: Major
> Labels: pull-request-available
> Time Spent: 2h
> Remaining Estimate: 0h
>
> *Bug description:*
> `fs.mkdirs` command for `RawLocalFileSystem` doesn't work in Hadoop 3 with
> NativeIO enabled.
> The failure was happening when doing the native `chmod` command to the file
> (the `mkdir` command itself is working).
> Stacktrace:
> {{ENOENT: No such file or directory ENOENT: No such file or directory at
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:382) at
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:974)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672)}}
>
> *To reproduce:*
> * Add `fs.mkdirs` in RawLocalFileSystem with NativeIO enabled.
> * Sample: [https://github.com/apache/hadoop/pull/3391]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]