InputFormatBase is used by some of the other input formats such as
SequenceFileInputFormat so changing it there will affect those other
classes as well. I don't know if that is what you want or not. I would
probably extend TextInputFormat (assuming the files are in text logs
such as apache logs and not xml files) and override the
areValidInputDirectories to checks for files in the directories and the
getSplits to return splits with only the files that you want to process.
Dennis
Vetle Roeim wrote:
On Mon, 16 Oct 2006 16:24:15 +0200, Dennis Kubes
<[EMAIL PROTECTED]> wrote:
You could write your own InputFormat implementation that would check
files instead of directories (perhaps passing in the parent directory
of the files).
Oh, so this restriction is in the InputFormat? I see in
InputFormatBase.getSplits that the code just goes through input
directories and gets all the files there. Would it be ok to just
modify the code to handle files as well?
The use case for this is if you have a directory containing multiple
files, but only want to operate on a few of those. In my case I have
log files from several servers, and while jobs are usually run on all
log files, this time I only want to run jobs on a subset.
We just did something similar to this for reading index files as an
InputFormat.
Dennis
Vetle Roeim wrote:
It seems that input to jobs is restricted to directories, and it is
impossible to add individual files -- JobConf calls
InputFormatBase.areValidInputDirectories, which checks that each
input path is a directory.
Why is this required? Is it possible to change it or work around it
(without copying the files into a separate directory)?
Thanks,
--Vetle Roeim
Opera Software ASA <URL: http://www.opera.com/ >
--Vetle Roeim
Team Manager, Information Systems
Opera Software ASA <URL: http://www.opera.com/ >