Dmitriy, Are you sure that you can't defeat this sorting?
On Mon, Dec 5, 2011 at 2:05 PM, Dmitriy Lyubimov <[email protected]> wrote: > what i usually do for small products is to accumulate them in the map > without combiners and then send them to reducers to combine there. > Using combiners creates local sort for something you don't really want > to do unless you can't keep it in memory . > > you also have a choice to force one reducer to ensure one product. Or > (what i do) build a loader that just goes all vectors and reduce them > on the fly whenever it is needed. > > MR is fundamentally too often not so cool when it comes to linear > algebra operations. This is a major deficiency as far as i see it. > Because it forces to sort output which already have inner structure > that one can capitalize on. > > Simple example: suppose you want to deinterlace a TV signal. So you > have two fields, one for odd lines, and another for even lines. So > naturally, what you do is just create a simple mering algorithm that > reads lines from each of inputs intermittently and outputs one single > deinterlaced frame. O(n). > > now, mapreduce way to deal with it is to shuffle and sort all lines by > their line No and then output total order (O(n*logN) > > so dealing with large blocks is like deinterlacing. you sort it but > you absolutely don't have to because you already have structure in the > output (even if blocks are not fitting in the memory). so hence. > > > On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon > <[email protected]> wrote: > > Thanks for clarifying. I think we're all on the same page on this, > although using different terms. I'll package up the job I currently have > for this and submit a patch. > > > > By the way, currently I have the rows being added at the combiner, and > then the results of the combiners added in a single reducer. Do you think > this is sufficient, or should multiple reducers be used (per column) to > further spread the load? > > > > On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <[email protected]> wrote: > > > >> ok column-wise mean. (the mean of all rows). > >> > >> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <[email protected]> > wrote: > >>> Row-wise mean usually means that a mean of each row is computed. > >>> > >>> I think that most PCA users would want column-wise means for > subtraction. > >>> > >>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <[email protected]> > wrote: > >>> > >>>> We probably need row wise mean computation job anyway as a separate > mr > >>>> step. Wanna take a stab? > >>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" < > [email protected]> > >>>> wrote: > >>>> > >>>>> Given that this request seems to come up frequently, would it be > worth > >>>>> putting this approach under mahout-examples? Initially it could use > the > >>>>> brute force approach together with SSVD, and updated later once > support > >>>> is > >>>>> ready for mean-subtraction within SSVD. > >>>>> > >>>>> I could put something together if there's interest. > >>>>> > >>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <[email protected]> > >>>>> wrote: > >>>>> > >>>>>> I am working on the addtions to ssvd algorithms and the mods to > current > >>>>>> solver will probably emerge in a matter of a month, my schedule > >>>>> permitting. > >>>>>> > >>>>>> However, a brute force approach is already possible. If your input > is > >>>> of > >>>>>> moderate size, or if it is already dense, you could compute median > and > >>>>>> substract it yourself very easily and then shove it into ssvd solver > >>>>> while > >>>>>> requesting to produce either u or v depending if subtract column > wise > >>>> or > >>>>>> row wise mean. > >>>>>> > >>>>>> The only problem with brute force approach is that it would densify > >>>>>> originally sparse input. Depending on your problem and # of machine > >>>> nodes > >>>>>> you can spare, it may or may not be a problem. > >>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <[email protected]> wrote: > >>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> Is there an expected release date for the PCA algorithm as part of > >>>>>> Mahout? > >>>>>>> Tx! > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> >
