Re: When is PCA expected to be fully implemented into Mahout?

Dmitriy Lyubimov Mon, 05 Dec 2011 15:28:22 -0800

Sometimes when i care.

in SSVD i just play with outer products of (k+p)x(k+p) size and
accumulate them in mappers directly, reduce to manageable number
(<=1000) and finalize in front end in


a single thread.

with power iterations i also use vertical blocks of significant size
as values when i compute AB'. Vertical blocking is easy because i know
product rows are short. That alone alowed me to slash multiplication
step there by an order of magnitude compared to when i was using
per-row summation of outer products (i guess that's the way DRM does
multiplication but it can't assume vertical blocking). Anyway, the key
is to store less values for shuffle-and sort, bigger blocks if needed,
instead of doing it on per-vector basis.

-d

On Mon, Dec 5, 2011 at 2:29 PM, Ted Dunning <[email protected]> wrote:
> Dmitriy,
>
> Are you sure that you can't defeat this sorting?
>
> On Mon, Dec 5, 2011 at 2:05 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
>> what i usually do for small products is to accumulate them in the map
>> without combiners and then send them to reducers to combine there.
>> Using combiners creates local sort for something you don't really want
>> to do unless you can't keep it in memory .
>>
>> you also have a choice to force one reducer to ensure one product. Or
>> (what i do) build a loader that just goes all vectors and reduce them
>> on the fly whenever it is needed.
>>
>> MR is fundamentally too often not so cool when it comes to linear
>> algebra operations. This is a major deficiency as far as i see it.
>> Because it forces to sort output which already have inner structure
>> that one can capitalize on.
>>
>> Simple example: suppose you want to deinterlace a TV signal. So you
>> have two fields, one for odd lines, and another for even lines. So
>> naturally, what you do is just create a simple mering algorithm that
>> reads lines from each of inputs intermittently and outputs one single
>> deinterlaced frame. O(n).
>>
>> now, mapreduce way to deal with it is to shuffle and sort all lines by
>> their line No and then output total order (O(n*logN)
>>
>> so dealing with large blocks is like deinterlacing. you sort it but
>> you absolutely don't have to because you already have structure in the
>> output (even if blocks are not fitting in the memory). so hence.
>>
>>
>> On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon
>> <[email protected]> wrote:
>> > Thanks for clarifying. I think we're all on the same page on this,
>> although using different terms. I'll package up the job I currently have
>> for this and submit a patch.
>> >
>> > By the way, currently I have the rows being added at the combiner, and
>> then the results of the combiners added in a single reducer. Do you think
>> this is sufficient, or should multiple reducers be used (per column) to
>> further spread the load?
>> >
>> > On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov <[email protected]> wrote:
>> >
>> >> ok column-wise mean. (the mean of all rows).
>> >>
>> >> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning <[email protected]>
>> wrote:
>> >>> Row-wise mean usually means that a mean of each row is computed.
>> >>>
>> >>> I think that most PCA users would want column-wise means for
>> subtraction.
>> >>>
>> >>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>> >>>
>> >>>> We probably need  row wise mean computation job anyway as a separate
>> mr
>> >>>> step. Wanna take a stab?
>> >>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" <
>> [email protected]>
>> >>>> wrote:
>> >>>>
>> >>>>> Given that this request seems to come up frequently, would it be
>> worth
>> >>>>> putting this approach under mahout-examples?  Initially it could use
>> the
>> >>>>> brute force approach together with SSVD, and updated later once
>> support
>> >>>> is
>> >>>>> ready for mean-subtraction within SSVD.
>> >>>>>
>> >>>>> I could put something together if there's interest.
>> >>>>>
>> >>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov <[email protected]>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> I am working on the addtions to ssvd algorithms and the mods to
>> current
>> >>>>>> solver will probably emerge in a matter of a month, my schedule
>> >>>>> permitting.
>> >>>>>>
>> >>>>>> However, a brute force approach is already possible. If your input
>> is
>> >>>> of
>> >>>>>> moderate size, or if it is already dense, you could compute median
>> and
>> >>>>>> substract it yourself very easily and then shove it into ssvd solver
>> >>>>> while
>> >>>>>> requesting to produce either u or v depending if subtract column
>> wise
>> >>>> or
>> >>>>>> row wise mean.
>> >>>>>>
>> >>>>>> The only problem with brute force approach is that it would densify
>> >>>>>> originally sparse input. Depending on your problem and # of machine
>> >>>> nodes
>> >>>>>> you can spare, it may or may not be a problem.
>> >>>>>> On Dec 4, 2011 7:59 PM, "magicalo" <[email protected]> wrote:
>> >>>>>>
>> >>>>>>> Hello,
>> >>>>>>>
>> >>>>>>> Is there an expected release date for the PCA algorithm as part of
>> >>>>>> Mahout?
>> >>>>>>> Tx!
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>>

Re: When is PCA expected to be fully implemented into Mahout?

Reply via email to