Sometimes when i care. in SSVD i just play with outer products of (k+p)x(k+p) size and accumulate them in mappers directly, reduce to manageable number (<=1000) and finalize in front end in
a single thread. with power iterations i also use vertical blocks of significant size as values when i compute AB'. Vertical blocking is easy because i know product rows are short. That alone alowed me to slash multiplication step there by an order of magnitude compared to when i was using per-row summation of outer products (i guess that's the way DRM does multiplication but it can't assume vertical blocking). Anyway, the key is to store less values for shuffle-and sort, bigger blocks if needed, instead of doing it on per-vector basis. -d On Mon, Dec 5, 2011 at 2:29 PM, Ted Dunning <[email protected]> wrote: > Dmitriy, > > Are you sure that you can't defeat this sorting? > > On Mon, Dec 5, 2011 at 2:05 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> what i usually do for small products is to accumulate them in the map >> without combiners and then send them to reducers to combine there. >> Using combiners creates local sort for something you don't really want >> to do unless you can't keep it in memory . >> >> you also have a choice to force one reducer to ensure one product. Or >> (what i do) build a loader that just goes all vectors and reduce them >> on the fly whenever it is needed. >> >> MR is fundamentally too often not so cool when it comes to linear >> algebra operations. This is a major deficiency as far as i see it. >> Because it forces to sort output which already have inner structure >> that one can capitalize on. >> >> Simple example: suppose you want to deinterlace a TV signal. So you >> have two fields, one for odd lines, and another for even lines. So >> naturally, what you do is just create a simple mering algorithm that >> reads lines from each of inputs intermittently and outputs one single >> deinterlaced frame. O(n). >> >> now, mapreduce way to deal with it is to shuffle and sort all lines by >> their line No and then output total order (O(n*logN) >> >> so dealing with large blocks is like deinterlacing. you sort it but >> you absolutely don't have to because you already have structure in the >> output (even if blocks are not fitting in the memory). so hence. >> >> >> On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon >> <[email protected]> wrote: >> > Thanks for clarifying. I think we're all on the same page on this, >> although using different terms. I'll package up the job I currently have >> for this and submit a patch. >> > >> > By the way, currently I have the rows being added at the combiner, and >> then the results of the combiners added in a single reducer. Do you think >> this is sufficient, or should multiple reducers be used (per column) to >> further spread the load? >> > >> > On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <[email protected]> wrote: >> > >> >> ok column-wise mean. (the mean of all rows). >> >> >> >> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <[email protected]> >> wrote: >> >>> Row-wise mean usually means that a mean of each row is computed. >> >>> >> >>> I think that most PCA users would want column-wise means for >> subtraction. >> >>> >> >>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >>> >> >>>> We probably need row wise mean computation job anyway as a separate >> mr >> >>>> step. Wanna take a stab? >> >>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" < >> [email protected]> >> >>>> wrote: >> >>>> >> >>>>> Given that this request seems to come up frequently, would it be >> worth >> >>>>> putting this approach under mahout-examples? Initially it could use >> the >> >>>>> brute force approach together with SSVD, and updated later once >> support >> >>>> is >> >>>>> ready for mean-subtraction within SSVD. >> >>>>> >> >>>>> I could put something together if there's interest. >> >>>>> >> >>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <[email protected]> >> >>>>> wrote: >> >>>>> >> >>>>>> I am working on the addtions to ssvd algorithms and the mods to >> current >> >>>>>> solver will probably emerge in a matter of a month, my schedule >> >>>>> permitting. >> >>>>>> >> >>>>>> However, a brute force approach is already possible. If your input >> is >> >>>> of >> >>>>>> moderate size, or if it is already dense, you could compute median >> and >> >>>>>> substract it yourself very easily and then shove it into ssvd solver >> >>>>> while >> >>>>>> requesting to produce either u or v depending if subtract column >> wise >> >>>> or >> >>>>>> row wise mean. >> >>>>>> >> >>>>>> The only problem with brute force approach is that it would densify >> >>>>>> originally sparse input. Depending on your problem and # of machine >> >>>> nodes >> >>>>>> you can spare, it may or may not be a problem. >> >>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <[email protected]> wrote: >> >>>>>> >> >>>>>>> Hello, >> >>>>>>> >> >>>>>>> Is there an expected release date for the PCA algorithm as part of >> >>>>>> Mahout? >> >>>>>>> Tx! >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >>
