Christopher Tubbs created HADOOP-19815:
------------------------------------------

             Summary: Path normalizes away important trailing slash used for 
URI.resolve(other)
                 Key: HADOOP-19815
                 URL: https://issues.apache.org/jira/browse/HADOOP-19815
             Project: Hadoop Common
          Issue Type: Bug
          Components: common
    Affects Versions: 3.4.2
            Reporter: Christopher Tubbs


This issue appears to be a relatively long-standing bug with Hadoop's 
FileSystem and Path classes, but is nevertheless important.

The core of the issue is that {{URI.resolve(...)}} relies on a trailing slash 
to determine how to resolve path components, but the trailing slash is often 
stripped out in common code paths for FileSystem and Path. This causes problems 
when trying to resolve new URIs/Paths from existing ones. Constructing a Path 
from a URI, rather than a String or another Path, does preserve the original 
URI, so things do resolve correctly, but that yields highly inconsistent 
behavior, and depends on the specifics of how it was constructed and how the 
original URI was preserved internally.

However, even if one argues that the String constructor for Path is supposed to 
normalize, and the URI constructor is supposed to preserve, the problem also 
exists with many of the {{FileSystem}} methods, such as {{{}fs.getUri(){}}}, 
{{{}fs.getHomeDirectory(){}}}, {{{}fs.getWorkingDirectory(){}}}, etc. So, one 
must do convoluted string manipulation to resolve one Path from another.

For example:
{code:java}
new Path("hdfs://localhost:8020/path/to/somewhere").toUri().resolve("other");
// expected ==> URI(hdfs://localhost:8020/path/to/other)
// actual ==> URI(hdfs://localhost:8020/path/to/other)

new Path("hdfs://localhost:8020/path/to/somewhere/").toUri().resolve("other");
// expected ==> URI(hdfs://localhost:8020/path/to/somewhere/other)
// actual ==> URI(hdfs://localhost:8020/path/to/other)



new Path(new 
URI("hdfs://localhost:8020/path/to/somewhere")).toUri().resolve("other");
// expected ==> URI(hdfs://localhost:8020/path/to/other)
// actual ==> URI(hdfs://localhost:8020/path/to/other)

new Path(new 
URI("hdfs://localhost:8020/path/to/somewhere/")).toUri().resolve("other");
// expected ==> URI(hdfs://localhost:8020/path/to/somewhere/other)
// actual ==> URI(hdfs://localhost:8020/path/to/somewhere/other)



var fs = FileSystem.get(new Configuration());
fs.getUri();
// expected ==> URI(hdfs://localhost:8020/)
// actual ==> URI(hdfs://localhost:8020) // probably matters more for 
LocalFileSystem or viewfs, etc.

fs.getWorkingDirectory().toUri();
fs.getHomeDirectory().toUri();
// expected ==> URI(hdfs://localhost:8020/user/me/)
// actual ==> URI(hdfs://localhost:8020/user/me)

// broken code
URI relativeURI = new URI("mytempdir");
fs.getWorkingDirectory().toUri().resolve(relativeURI);
// expected ==> hdfs://localhost:8020/user/me/mytempdir
// actual ==> hdfs://localhost:8020/user/mytempdir

// convoluted workaround (assuming relative path in the suffix without any 
other URI elements)
URI relativeURI = new URI("mytempdir");
fs.getWorkingDirectory().suffix("/" + relativeURI.toString()).toUri();
// expected ==> hdfs://localhost:8020/user/me/mytempdir
// actual ==> hdfs://localhost:8020/user/me/mytempdir
{code}
Some of this is workable, so long as you're staying with Path, but the moment 
you try to work with URIs/URLs, things get convoluted quickly, requiring 
{{toString()}} calls and concatenation with slash {{/}} characters, and edge 
cases when the other path isn't relative, or contains a different authority or 
scheme, etc. These are things {{URI.resolve()}} would already handle, so code 
can get unnecessarily complex to work around these API problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to