[jira] Commented: (HADOOP-3173) inconsistent globbing support for dfs commands

Chris Douglas (JIRA) Fri, 16 May 2008 18:02:23 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597672#action_12597672
 ]


Chris Douglas commented on HADOOP-3173:
---------------------------------------

I mostly agree with Hairong; this is easy to do programmatically, and while 
there are a few alternatives (different escape character, URI encoding, new 
"literal" FsShell commands, etc), most appear to make the general case worse to 
accommodate a fairly esoteric use.

On the other hand, there are only a few places (FsShell and FileInputFormat, 
mainly) where we call globStatus, and in each case a String is converted to a 
Path before being converted back into a String in globStatus. Without the 
conversion, the pattern syntax can mandate that the path separator must be '/' 
independent of the Path syntax. Unfortunately, actually effecting this change 
is awkward, primarily because one must still create a Path of the glob string 
to obtain the FileSystem to resolve it against. If the glob string creates a 
Path to be resolved against a FileSystem other than the default, then the 
scheme, authority, etc. must be excised from the original string to preserve 
the escaping, etc., which will ultimately duplicate much of the URI parsing 
that's already happening in Path. Particularly for FileInputFormat and its 
users, pulling out all the Path dependencies (i.e. changing users of the 
globbing API) is a huge job with a modest payback.

Since Path(String) already isolates this segment, we could introduce 
Path::getRawPath that would preserve the path before Path::normalizePath and 
URI::normalize. With this, globStatus would resolve Path::getRawPath instead of 
p.toUri().getPath(). Unfortunately, this would mean that globStatus(p) might 
return different results than globstatus(new Path(p.toString())), which means 
FileInputFormat would still have this issue. Even if Path(Path, String) and 
variants preserved a raw path, its semantics would be unclear. In Path(Path, 
String), is the raw Path only eq the raw Path from the second arg if it is 
absolute? Is it the raw path from the first arg preserved in some way? We could 
just assert that the raw path is only different from p.toUri().getPath() if it 
was created with Path(String), but this could be confusing when creating globs 
from a base path (i.e. Path(Path, String) or possibly more confusing, 
Path(String, Path)). The URI normalization also removes all the ".." and "." 
entries in the Path, which the regexp would then have to handle (e.g. 
"a/b/../c*" is resolved to "a/c*" now, but using the raw path, GlobFilter would 
accept "a/b/dd/c" since '.' matches GlobFilter::PAT_ANY). That said, 
FileInputFormats and all Strings that were once Paths wouldn't have to deal 
with this, while utilities like FsShell could match "a/b/../c" as regexps, 
which might not be a bad thing.

If we want to fix this, I'd propose adding Path::getRawPath which would be used 
in FileSystem::globStatus, but could only be different from 
p.getUri().getPath() when the Path was created from a String. This covers cases 
where one wants to create a Path regexp manually and use it as a glob (as in 
FsShell), but should not change behavior elsewhere.

Thoughts?

> inconsistent globbing support for dfs commands
> ----------------------------------------------
>
>                 Key: HADOOP-3173
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3173
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>         Environment: Hadoop 0.16.1
>            Reporter: Rajiv Chittajallu
>             Fix For: 0.18.0
>
>
> hadoop dfs -mkdir /user/*/bar creates a directory "/user/*/bar" and you cant 
> deleted /user/* as -rmr expands the glob
> $ hadoop dfs -mkdir /user/rajive/a/*/foo
> $ hadoop dfs -ls /user/rajive/a
> Found 4 items
> /user/rajive/a/*      <dir>           2008-04-04 16:09        rwx------       
> rajive  users
> /user/rajive/a/b      <dir>           2008-04-04 16:08        rwx------       
> rajive  users
> /user/rajive/a/c      <dir>           2008-04-04 16:08        rwx------       
> rajive  users
> /user/rajive/a/d      <dir>           2008-04-04 16:08        rwx------       
> rajive  users
> $ hadoop dfs -ls /user/rajive/a/*
> /user/rajive/a/*/foo  <dir>           2008-04-04 16:09        rwx------       
> rajive  users
> $ hadoop dfs -rmr /user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> I am not able to escape '*' from being expanded.
> $ hadoop dfs -rmr '/user/rajive/a/*'
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> $ hadoop dfs -rmr  '/user/rajive/a/\*'
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> $ hadoop dfs -rmr  /user/rajive/a/\* 
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3173) inconsistent globbing support for dfs commands

Reply via email to