Hi Markus,

  Are you saying you want something more sophisticated than just setting
your number of reducers equal to zero and then repeatedly running your
Map(minus Reduce) job on Hadoop?  The Mappers will go where the
data is, as you say, but if your mapper output then needs to be collected
in some place and aggregated, you'll need the shuffle + Reduce step.

If your mapper output is very small (you're reading in the data set
multiple times, but training up a very small model), then you might
need the reducers and can transmit the model via side channel methods
to all the nodes for the next pass.

Is this the kind of thing you're talking about?

  -jake

On Thu, Jan 28, 2010 at 11:00 AM, Markus Weimer
<[email protected]>wrote:

> Hi,
>
> I have a question about hadoop, which most likely someone in mahout
> must have solved before:
>
> Many online ML algorithms require multiple passes over data for best
> performance. When putting these algorithms on hadoop, one would want
> to run the code close to the data (same machine/rack). Mappers offer
> this data-local execution but do not offer means to run multiple times
> over the data. Of course, one could run the code outside of the hadoop
> mapreduce framework as a HDFS client, but that does not offer the
> data-locality advantage, in addition to not being scheduled through
> the hadoop schedulers.
>
> How is this solved in mahout?
>
> Thanks for any pointer,
>
> Markus
>

Reply via email to