I have a directory with per-week subdirectories.
Currently, I have specify a separate -input for each subdirectory, with a glob pattern (the data start in 2004, and is supposed to be added weekly, ongoing). It is possible to flatten the structure, but is there a big conceptual or implementational problem with general globbing?

On Dec 4, 2006, at 11:41 PM, eric baldeschwieler (JIRA) wrote:

[ http://issues.apache.org/jira/browse/HADOOP-619? page=comments#action_12455532 ]

eric baldeschwieler commented on HADOOP-619:
--------------------------------------------

Perhaps we should just limit to either globing or a single directory per argument and simply drop directories from globbing? This seems fairly simple and not too restrictive in practice.

I agree that if a directory is used we should exclude files starting with "_". This will allow us to put metadata in output directories. I think we should also simply exclude subdirectories in input directories. Again, I doubt this will prove restrictive in practice.

It seems to me we should error out if any glob matches no files or a listed input directory is not present. Perhaps we could provide another switch for an optional input in case users actual want a job to run if an input spec doesn't match any input.

Unify Map-Reduce and Streaming to take the same globbed input specification ---------------------------------------------------------------------- -----

                Key: HADOOP-619
                URL: http://issues.apache.org/jira/browse/HADOOP-619
            Project: Hadoop
         Issue Type: Improvement
         Components: mapred
           Reporter: eric baldeschwieler
        Assigned To: Sanjay Dahiya

Right now streaming input is specified very differently from other map-reduce input. It would be good if these two apps could take much more similar input specs. In particular -input in streaming expects a file or glob pattern while MR takes a directory. It would be cool if both could take a glob patern of files and if both took a directory by default (with some patern excluded to allow logs, metadata and other framework output to be safely stored). We want to be sure that MR input is backward compatible over this change. I propose that a single file should be accepted as an input or a single directory. Globs should only match directories if the paterns is '/' terminated, to avoid massive inputs specified by mistake.
Thoughts?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



Reply via email to