[
https://issues.apache.org/jira/browse/HADOOP-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597672#action_12597672
]
Chris Douglas commented on HADOOP-3173:
---------------------------------------
I mostly agree with Hairong; this is easy to do programmatically, and while
there are a few alternatives (different escape character, URI encoding, new
"literal" FsShell commands, etc), most appear to make the general case worse to
accommodate a fairly esoteric use.
On the other hand, there are only a few places (FsShell and FileInputFormat,
mainly) where we call globStatus, and in each case a String is converted to a
Path before being converted back into a String in globStatus. Without the
conversion, the pattern syntax can mandate that the path separator must be '/'
independent of the Path syntax. Unfortunately, actually effecting this change
is awkward, primarily because one must still create a Path of the glob string
to obtain the FileSystem to resolve it against. If the glob string creates a
Path to be resolved against a FileSystem other than the default, then the
scheme, authority, etc. must be excised from the original string to preserve
the escaping, etc., which will ultimately duplicate much of the URI parsing
that's already happening in Path. Particularly for FileInputFormat and its
users, pulling out all the Path dependencies (i.e. changing users of the
globbing API) is a huge job with a modest payback.
Since Path(String) already isolates this segment, we could introduce
Path::getRawPath that would preserve the path before Path::normalizePath and
URI::normalize. With this, globStatus would resolve Path::getRawPath instead of
p.toUri().getPath(). Unfortunately, this would mean that globStatus(p) might
return different results than globstatus(new Path(p.toString())), which means
FileInputFormat would still have this issue. Even if Path(Path, String) and
variants preserved a raw path, its semantics would be unclear. In Path(Path,
String), is the raw Path only eq the raw Path from the second arg if it is
absolute? Is it the raw path from the first arg preserved in some way? We could
just assert that the raw path is only different from p.toUri().getPath() if it
was created with Path(String), but this could be confusing when creating globs
from a base path (i.e. Path(Path, String) or possibly more confusing,
Path(String, Path)). The URI normalization also removes all the ".." and "."
entries in the Path, which the regexp would then have to handle (e.g.
"a/b/../c*" is resolved to "a/c*" now, but using the raw path, GlobFilter would
accept "a/b/dd/c" since '.' matches GlobFilter::PAT_ANY). That said,
FileInputFormats and all Strings that were once Paths wouldn't have to deal
with this, while utilities like FsShell could match "a/b/../c" as regexps,
which might not be a bad thing.
If we want to fix this, I'd propose adding Path::getRawPath which would be used
in FileSystem::globStatus, but could only be different from
p.getUri().getPath() when the Path was created from a String. This covers cases
where one wants to create a Path regexp manually and use it as a glob (as in
FsShell), but should not change behavior elsewhere.
Thoughts?
> inconsistent globbing support for dfs commands
> ----------------------------------------------
>
> Key: HADOOP-3173
> URL: https://issues.apache.org/jira/browse/HADOOP-3173
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Environment: Hadoop 0.16.1
> Reporter: Rajiv Chittajallu
> Fix For: 0.18.0
>
>
> hadoop dfs -mkdir /user/*/bar creates a directory "/user/*/bar" and you cant
> deleted /user/* as -rmr expands the glob
> $ hadoop dfs -mkdir /user/rajive/a/*/foo
> $ hadoop dfs -ls /user/rajive/a
> Found 4 items
> /user/rajive/a/* <dir> 2008-04-04 16:09 rwx------
> rajive users
> /user/rajive/a/b <dir> 2008-04-04 16:08 rwx------
> rajive users
> /user/rajive/a/c <dir> 2008-04-04 16:08 rwx------
> rajive users
> /user/rajive/a/d <dir> 2008-04-04 16:08 rwx------
> rajive users
> $ hadoop dfs -ls /user/rajive/a/*
> /user/rajive/a/*/foo <dir> 2008-04-04 16:09 rwx------
> rajive users
> $ hadoop dfs -rmr /user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> I am not able to escape '*' from being expanded.
> $ hadoop dfs -rmr '/user/rajive/a/*'
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> $ hadoop dfs -rmr '/user/rajive/a/\*'
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> $ hadoop dfs -rmr /user/rajive/a/\*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.