InputFormatBase is used by some of the other input formats such as SequenceFileInputFormat so changing it there will affect those other classes as well. I don't know if that is what you want or not. I would probably extend TextInputFormat (assuming the files are in text logs such as apache logs and not xml files) and override the areValidInputDirectories to checks for files in the directories and the getSplits to return splits with only the files that you want to process.

Dennis

Vetle Roeim wrote:
On Mon, 16 Oct 2006 16:24:15 +0200, Dennis Kubes <[EMAIL PROTECTED]> wrote:

You could write your own InputFormat implementation that would check files instead of directories (perhaps passing in the parent directory of the files).

Oh, so this restriction is in the InputFormat? I see in InputFormatBase.getSplits that the code just goes through input directories and gets all the files there. Would it be ok to just modify the code to handle files as well?

The use case for this is if you have a directory containing multiple files, but only want to operate on a few of those. In my case I have log files from several servers, and while jobs are usually run on all log files, this time I only want to run jobs on a subset.


We just did something similar to this for reading index files as an InputFormat.

Dennis

Vetle Roeim wrote:
It seems that input to jobs is restricted to directories, and it is impossible to add individual files -- JobConf calls InputFormatBase.areValidInputDirectories, which checks that each input path is a directory.

Why is this required? Is it possible to change it or work around it (without copying the files into a separate directory)?


Thanks,
--Vetle Roeim
Opera Software ASA <URL: http://www.opera.com/ >



--Vetle Roeim
Team Manager, Information Systems
Opera Software ASA <URL: http://www.opera.com/ >

Reply via email to