[
https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158057#comment-13158057
]
Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------
The way i understood original idea from Ted, since we are performing projection
into B, then the center of original data would also project onto center of
projected data (in this case, data are column vectors).
if row vectors are implied as pca items that means subtraction of row mean but
i am not 100% sure how this works, but it seems that this case can be solved by
finding row-mean of Y and proceed with Y-M_y instead of Y. However, i am not
sure at all how it plays out esp. with power iterations. It would seem to me
that random projection of centered vs. non-centered data may not be the same in
the context of this method. I don't immediately see this.
Even subtraction of median in B may affect the accuracy because random
projection captured the action of the original data, but not necessarily the
centered data. Once data is centered, the optimal subspace capturing variances
might be quite different from original subspace produced in Q. That's why i say
maybe brute force approach is the right one. At least i can easily convince
myself it is what PCA defines.
In addition, the main difficulty is that to know mean of A, we need one
separate pass over A (at least with a row mean), and the whole idea is that
probably we can do it on the fly somewehre else with already projected data.
bq. One question: is it necessary to do mean-subtraction of A before computing
the QR decomposition, or will the columns of Q still
form a good basis even without mean-subtraction?
That's exactly my concern. i think this breaks the fundamental premise of the
method (unless it somehow magically appears to be just as good, bit it would
seem to me it is not, at least i can construct a visual counterexample in my
head).
So assume we need to do subtraction before attempting to find a good basis for
projection. Then for the case of column-wise mean it is easy, we can do it on
the fly and we need just one pass over data while doing the Y and Q stuff. If
we want a row-wise mean, the brute force requires one more pass to aquire the
mean.
bq. It seems there are two jobs that need to be modified: BBT-job and V-job.
Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean( i ).
BBt job is now obsolete. BBt is now produced in reducers of Bt job as a bonus
and finalized in the front end.
> Add PCA options to SSVD code
> ----------------------------
>
> Key: MAHOUT-817
> URL: https://issues.apache.org/jira/browse/MAHOUT-817
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.6
> Reporter: Dmitriy Lyubimov
> Assignee: Dmitriy Lyubimov
> Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean
> subtraction into SSVD algorithm without making it a pre-requisite step and
> also avoiding densifying the big input.
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data
> is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and
> math are welcome.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira