On Thu, May 16, 2013 at 9:42 AM, Ángel Martínez González <[email protected] > wrote:
> El 15/05/2013 20:15, Pat Ferrel escribió: > > Scaling some jobs requires splitting files to get multiple mappers and in >> other cases files must be combined into one to even run the job. Scaling is >> one huge reason to use Mahout, so it seems like it should be easier and >> input should universally be a dir of parts *or* single file. >> > I think that is generally already true if I interpret that statement correctly. For distributed jobs, those that i know, it doesn't matter much if there's one ore more files. Bigger files are generally preferred by Hadoop MR which is Hadoop-specific thing but by no means having a single file is a requirement for the MR-based stuff. Please let us know of specific situations where you think this is not the case, it would help to direct the focus. Size of the files does matter for some blocking algorithms that require to split matrix into vertical blocks of some minimum size. E.g. SSVD requires to be able to form vertical blocks of at least k rows high where k (decomposition rank) is typically no bigger than 100. In practice it means files that fall under this restriction would be less than 1M in size, which in context of big data (say 10G input would be typically the starting input size for the distributed SSVD application to be remotely interesting) sounds quite OK to have. Again I would be interested to know where you think this specifically might have been a BIG DATA issue (i.e. where it is not about a fringe case of a toy input :) There are workaround techniques to fight that condition automatically (e.g. Pig loader has an option to merge multiple tiny files into single split on the fly) but overall i don't see evidence this is as much of the case in ML world so I generally don't view adding such input format option as a necessity. i haven't ever needed this, but you are welcome to work on it and submit a patch :) thanks :) > +1 To this. It should be easier to switch between sequential and > distributed mode of jobs. >
