[ 
https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158057#comment-13158057
 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

The way i understood original idea from Ted, since we are performing projection 
into B, then the center of original data would also project onto center of 
projected data (in this case, data are column vectors). 

if row vectors are implied as pca items that means subtraction of row mean but 
i am not 100% sure how this works, but it seems that this case can be solved by 
finding row-mean of Y and proceed with Y-M_y instead of Y. However, i am not 
sure at all how it plays out esp. with power iterations. It would seem to me 
that random projection of centered vs. non-centered data may not be the same in 
the context of this method. I don't immediately see this. 

Even subtraction of median in B may affect the accuracy because random 
projection captured the action of the original data, but not necessarily the 
centered data. Once data is centered, the optimal subspace capturing variances 
might be quite different from original subspace produced in Q. That's why i say 
maybe brute force approach is the right one. At least i can easily convince 
myself it is what PCA defines.

In addition, the main difficulty is that to know mean of A, we need one 
separate pass over A (at least with a row mean), and the whole idea is that 
probably we can do it on the fly somewehre else with already projected data. 

bq. One question: is it necessary to do mean-subtraction of A before computing 
the QR decomposition, or will the columns of Q still
form a good basis even without mean-subtraction?

That's exactly my concern. i think this breaks the fundamental premise of the 
method (unless it somehow magically appears to be just as good, bit it would 
seem to me it is not, at least i can construct a visual counterexample in my 
head).

So assume  we need to do subtraction before attempting to find a good basis for 
projection. Then for the case of column-wise mean it is easy, we can do it on 
the fly and we need just one pass over data while doing the Y and Q stuff. If 
we want a row-wise mean, the brute force requires one more pass to aquire the 
mean.

bq. It seems there are two jobs that need to be modified: BBT-job and V-job. 
Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean( i ).

BBt job is now obsolete. BBt is now produced in reducers of Bt job as a bonus 
and finalized in the front end.


                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean 
> subtraction into SSVD algorithm without making it a pre-requisite step and 
> also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data 
> is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and 
> math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to