Glad that you asked because I have been asking the same question myself when creating a Text->Vector convertor where i need to iterate over the same data converting them to vectors using a chunk of dictionary at a time. If i had the option of running multiple passes. It would have taken me just a single mapreduce. Here i have to do 1 pass over the data for every chunk of dictionary in memory. True, I can run n sequential job using a HDFS client on different servers. The network data transfer wasn't worth it.
Robin On Fri, Jan 29, 2010 at 12:30 AM, Markus Weimer <[email protected]>wrote: > Hi, > > I have a question about hadoop, which most likely someone in mahout > must have solved before: > > Many online ML algorithms require multiple passes over data for best > performance. When putting these algorithms on hadoop, one would want > to run the code close to the data (same machine/rack). Mappers offer > this data-local execution but do not offer means to run multiple times > over the data. Of course, one could run the code outside of the hadoop > mapreduce framework as a HDFS client, but that does not offer the > data-locality advantage, in addition to not being scheduled through > the hadoop schedulers. > > How is this solved in mahout? > > Thanks for any pointer, > > Markus > -- ------ Robin Anil Blog: http://techdigger.wordpress.com ------- Mahout in Action - Mammoth Scale machine learning Read Chapter 1 - Its Frrreeee http://www.manning.com/owen Try out Swipeball for iPhone http://itunes.com/apps/swipeball
