Is there a reason to split "-input" and "-input-filter"?
If so, please make sure that there is a way to have
(a) different "-input-filter" for different "-input", and
(b) multiple "-input-filter" for the same "-input" (special case of (a))
(c) wild cards for specifying directories in "-input"
On Dec 4, 2006, at 2:17 PM, Owen O'Malley (JIRA) wrote:
[
http://issues.apache.org/jira/browse/HADOOP-619?
page=comments#action_12455431 ]
Owen O'Malley commented on HADOOP-619:
--------------------------------------
I think that streaming should stop using the filename globbing for
--input and instead list a set of input directories.
I think that the InputFormatBase should support a regex filter on
filenames that should default to something like "[^_].*" so that any
file that starts with an "_" is not treated as input. This will allow
us to put things like _LOGS as a filename to store the log files for a
job.
Streaming should then support the regex filename filters using
"--input-filter" with a regex that filters the filenames.
Unify Map-Reduce and Streaming to take the same globbed input
specification
----------------------------------------------------------------------
-----
Key: HADOOP-619
URL: http://issues.apache.org/jira/browse/HADOOP-619
Project: Hadoop
Issue Type: Improvement
Components: mapred
Reporter: eric baldeschwieler
Assigned To: Sanjay Dahiya
Right now streaming input is specified very differently from other
map-reduce input. It would be good if these two apps could take much
more similar input specs.
In particular -input in streaming expects a file or glob pattern
while MR takes a directory. It would be cool if both could take a
glob patern of files and if both took a directory by default (with
some patern excluded to allow logs, metadata and other framework
output to be safely stored).
We want to be sure that MR input is backward compatible over this
change. I propose that a single file should be accepted as an input
or a single directory. Globs should only match directories if the
paterns is '/' terminated, to avoid massive inputs specified by
mistake.
Thoughts?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira