Enis, I was trying to understand how MultiFileInputFormat works but I could not.
My use case is: * several small (a few megs) SequenceFiles as input files. I need to make sure I don't end up with a Map task per input file. Ideally I would like to get sets of input files of size X (the size of all the files in the set) as one split. Ideas are welcome. A On 10/15/07, Enis Soztutar <[EMAIL PROTECTED]> wrote: > > I'm not really sure if it helps but there is a MultiFileSplit and > MultiFileInputFormat which is optimized for cases where numFiles > > numMapTasks. Let me know if you have any further questions. > > Alejandro Abdelnur wrote: > > The input for a M/R job consists of multiple files that are less than a > > block size and the number of maps is the number of files. > > > > I would like to be able to control the number of maps in a way that I > have > > one map task for multiple files (for example, gluing together files up > to a > > block size). > > > > I don't want to use a M/R job to do that as it is expensive (extra IO > ops: > > read/write-read/write) > > > > I don't want to have a COPY program as this is still expensive (extra IO > > ops: read/write) > > > > I know files are not that big, but this is the common case in my system > and > > this would mean increasing the number of IO significantly. > > > > I'd rather would want to have a custom InputSplit that takes multiple > files > > up to a given size, then I don't have any extra IO ops. > > > > Looking at the InputSplit the interfaces do not seem prepared to be able > do > > such thing (consolidating multiple files into a single split). > > > > Am I missing something on the APIs? Or another suggestion on how to > achieve > > the desired behavior? > > > > Thxs. > > > > Alejandro > > > > >
