I have a directory with per-week subdirectories.
Currently, I have specify a separate -input for each subdirectory, with
a glob pattern (the data start in 2004, and is supposed to be added
weekly, ongoing).
It is possible to flatten the structure, but is there a big conceptual
or implementational problem with general globbing?
On Dec 4, 2006, at 11:41 PM, eric baldeschwieler (JIRA) wrote:
[
http://issues.apache.org/jira/browse/HADOOP-619?
page=comments#action_12455532 ]
eric baldeschwieler commented on HADOOP-619:
--------------------------------------------
Perhaps we should just limit to either globing or a single directory
per argument and simply drop directories from globbing? This seems
fairly simple and not too restrictive in practice.
I agree that if a directory is used we should exclude files starting
with "_". This will allow us to put metadata in output directories.
I think we should also simply exclude subdirectories in input
directories. Again, I doubt this will prove restrictive in practice.
It seems to me we should error out if any glob matches no files or a
listed input directory is not present. Perhaps we could provide
another switch for an optional input in case users actual want a job
to run if an input spec doesn't match any input.
Unify Map-Reduce and Streaming to take the same globbed input
specification
----------------------------------------------------------------------
-----
Key: HADOOP-619
URL: http://issues.apache.org/jira/browse/HADOOP-619
Project: Hadoop
Issue Type: Improvement
Components: mapred
Reporter: eric baldeschwieler
Assigned To: Sanjay Dahiya
Right now streaming input is specified very differently from other
map-reduce input. It would be good if these two apps could take much
more similar input specs.
In particular -input in streaming expects a file or glob pattern
while MR takes a directory. It would be cool if both could take a
glob patern of files and if both took a directory by default (with
some patern excluded to allow logs, metadata and other framework
output to be safely stored).
We want to be sure that MR input is backward compatible over this
change. I propose that a single file should be accepted as an input
or a single directory. Globs should only match directories if the
paterns is '/' terminated, to avoid massive inputs specified by
mistake.
Thoughts?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira