On Thu, May 16, 2013 at 9:42 AM, Ángel Martínez González <[email protected]
> wrote:

> El 15/05/2013 20:15, Pat Ferrel escribió:
>
>  Scaling some jobs requires splitting files to get multiple mappers and in
>> other cases files must be combined into one to even run the job. Scaling is
>> one huge reason to use Mahout, so it seems like it should be easier and
>> input should universally be a dir of parts *or* single file.
>>
>
I think that is generally already true if I interpret that statement
correctly. For distributed jobs, those that i know, it doesn't matter much
if there's one ore more files.  Bigger files are generally preferred by
Hadoop MR which is Hadoop-specific thing but by no means having a single
file is a requirement for the MR-based stuff. Please let us know of
specific situations where you think this is not the case, it would help to
direct the focus.

Size of the files does matter for some blocking algorithms that require to
split matrix into vertical blocks of some minimum size. E.g. SSVD requires
to be able to form vertical blocks of at least k rows high where k
(decomposition rank) is typically no bigger than 100. In practice it means
files that fall under this restriction would be less than 1M in size, which
in context of big data (say 10G input would be typically the starting input
size for the distributed SSVD application to be remotely interesting)
sounds quite OK to have. Again I would be interested to know where you
think this specifically might have been a BIG DATA issue (i.e. where it is
not about a fringe case of a toy input :)

There are workaround techniques to fight that condition automatically (e.g.
Pig loader has an option to merge multiple tiny files into single split on
the fly) but overall i don't see evidence this is as much of the case in ML
world so I generally don't view adding such input format option as a
necessity. i haven't ever needed this, but you are welcome to work on it
and submit a patch :)

thanks :)


>  +1 To this. It should be easier to switch between sequential and
> distributed mode of jobs.
>

Reply via email to