+1 on Runping's comment.
It is very important to report the information about the matching input
files to the user (as the output of Hadoop Streaming command)
Specifically
-- each wildcard parameter that does not match any files, should be
explicitly reported, but not to cause a failure
-- an exact (non-wildcard) parameter that does not match any file or
directory has to be reported and CAUSE a failure
-- for every directory, the number of files in the directory should be
reported.
-- if after successful wildcard matching the input has 0 files, the
command should fail.
Failure implies
-- exit with non-0 return status
-- produce an error message explaining a failure. No un-handled
exceptions.
-- ab
On Dec 1, 2006, at 8:52 AM, Runping Qi (JIRA) wrote:
[
http://issues.apache.org/jira/browse/HADOOP-619?
page=comments#action_12454951 ]
Runping Qi commented on HADOOP-619:
-----------------------------------
We should check the existence of the path(s), but leave the pattern
matching to InputFormatBase class.
We have three cases to consider: all paths exist, non exists, and some
do, some do not.
Cases 1 and 2 are easy. Case 3, I hink it is reasonable that the job
proceeds, but records that in the job
status.
When a job starts, we may should record all the input directories and
number of files under it matching the pattern. If no file matches the
pattern, the job should stop. Otherwise, the job generates splits from
the matched files and proceeds.
Unify Map-Reduce and Streaming to take the same globbed input
specification
----------------------------------------------------------------------
-----
Key: HADOOP-619
URL: http://issues.apache.org/jira/browse/HADOOP-619
Project: Hadoop
Issue Type: Improvement
Components: mapred
Reporter: eric baldeschwieler
Assigned To: Sanjay Dahiya
Right now streaming input is specified very differently from other
map-reduce input. It would be good if these two apps could take much
more similar input specs.
In particular -input in streaming expects a file or glob pattern
while MR takes a directory. It would be cool if both could take a
glob patern of files and if both took a directory by default (with
some patern excluded to allow logs, metadata and other framework
output to be safely stored).
We want to be sure that MR input is backward compatible over this
change. I propose that a single file should be accepted as an input
or a single directory. Globs should only match directories if the
paterns is '/' terminated, to avoid massive inputs specified by
mistake.
Thoughts?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira