Dmitriy,

Are you sure that you can't defeat this sorting?

On Mon, Dec 5, 2011 at 2:05 PM, Dmitriy Lyubimov <[email protected]> wrote:

> what i usually do for small products is to accumulate them in the map
> without combiners and then send them to reducers to combine there.
> Using combiners creates local sort for something you don't really want
> to do unless you can't keep it in memory .
>
> you also have a choice to force one reducer to ensure one product. Or
> (what i do) build a loader that just goes all vectors and reduce them
> on the fly whenever it is needed.
>
> MR is fundamentally too often not so cool when it comes to linear
> algebra operations. This is a major deficiency as far as i see it.
> Because it forces to sort output which already have inner structure
> that one can capitalize on.
>
> Simple example: suppose you want to deinterlace a TV signal. So you
> have two fields, one for odd lines, and another for even lines. So
> naturally, what you do is just create a simple mering algorithm that
> reads lines from each of inputs intermittently and outputs one single
> deinterlaced frame. O(n).
>
> now, mapreduce way to deal with it is to shuffle and sort all lines by
> their line No and then output total order (O(n*logN)
>
> so dealing with large blocks is like deinterlacing. you sort it but
> you absolutely don't have to because you already have structure in the
> output (even if blocks are not fitting in the memory). so hence.
>
>
> On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon
> <[email protected]> wrote:
> > Thanks for clarifying. I think we're all on the same page on this,
> although using different terms. I'll package up the job I currently have
> for this and submit a patch.
> >
> > By the way, currently I have the rows being added at the combiner, and
> then the results of the combiners added in a single reducer. Do you think
> this is sufficient, or should multiple reducers be used (per column) to
> further spread the load?
> >
> > On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <[email protected]> wrote:
> >
> >> ok column-wise mean. (the mean of all rows).
> >>
> >> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <[email protected]>
> wrote:
> >>> Row-wise mean usually means that a mean of each row is computed.
> >>>
> >>> I think that most PCA users would want column-wise means for
> subtraction.
> >>>
> >>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
> >>>
> >>>> We probably need  row wise mean computation job anyway as a separate
> mr
> >>>> step. Wanna take a stab?
> >>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <
> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Given that this request seems to come up frequently, would it be
> worth
> >>>>> putting this approach under mahout-examples?  Initially it could use
> the
> >>>>> brute force approach together with SSVD, and updated later once
> support
> >>>> is
> >>>>> ready for mean-subtraction within SSVD.
> >>>>>
> >>>>> I could put something together if there's interest.
> >>>>>
> >>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> I am working on the addtions to ssvd algorithms and the mods to
> current
> >>>>>> solver will probably emerge in a matter of a month, my schedule
> >>>>> permitting.
> >>>>>>
> >>>>>> However, a brute force approach is already possible. If your input
> is
> >>>> of
> >>>>>> moderate size, or if it is already dense, you could compute median
> and
> >>>>>> substract it yourself very easily and then shove it into ssvd solver
> >>>>> while
> >>>>>> requesting to produce either u or v depending if subtract column
> wise
> >>>> or
> >>>>>> row wise mean.
> >>>>>>
> >>>>>> The only problem with brute force approach is that it would densify
> >>>>>> originally sparse input. Depending on your problem and # of machine
> >>>> nodes
> >>>>>> you can spare, it may or may not be a problem.
> >>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <[email protected]> wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> Is there an expected release date for the PCA algorithm as part of
> >>>>>> Mahout?
> >>>>>>> Tx!
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
>

Reply via email to