[
https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091927#comment-13091927
]
Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------
more thoughts:
1 -- it looks like this method is compressing a lot of info into k+p
upper-triangular R. I guess there's more potential for rounding errors there.
(I am not buying too much into rounding errors though).
2 -- Argument of Q matrix being large is not very convincing to me. Q is n x
(k+p), that is, assuming A is much wider (in terms of non-zero element average
#) which it should be in order for projection method to make sense at all, it
is a fraction of A. Besides, it is written in blocks by mappers, and read by
mappers, so each mapper sees only one block of size say 30,000 x (k+p), which
it will not be, assuming A is sufficiently wide, and assuming k+p=500, amounts
to 120mb per mapper. So there's not so much actual memory pressure, that's what
distributed computations are for. Same goes for dense operations, we just
distribute them.
3 -- it looks like you are writing B as output of the second MR, which is (k+p)
x n. Going back to argument of a 'big Q', we can't assume that B would be any
less. In fact, some time ago i came to conclusion that it looks like projection
method would be much more efficient if input is wide rather than tall since
projection compression factor would be much better for what seems to be fairly
inexpensive operation (since it costs nothing to redistribute Omega which is
only backed by one long number as a random gen seed, as opposed to actual long
or wide bands such as B or Q). So we can't exclude very wide inputs.
Overall it looks like a great improvement. I am not convinced that it would cut
processing time that much (it looks it has the same amount of proposed MR steps
but it looks like all of them would require shuffle-and-sort and reduce phase,
whereas with QR is a reduceless process), but it definitely reduces complexity
of MR implementation and i would be very eager to try it out. Certainly all i
said is the first impression and intuition only; and in my experience
intuition turns out to be wrong surprisingly often as far as benchmark
guesstimates are concerned.
> Add new stochastic decomposition code
> -------------------------------------
>
> Key: MAHOUT-792
> URL: https://issues.apache.org/jira/browse/MAHOUT-792
> Project: Mahout
> Issue Type: New Feature
> Reporter: Ted Dunning
> Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms. This
> eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
> - a CholeskyDecomposition implementation that does pivoting (and thus
> rank-revealing) or not. This should actually be useful for solution of large
> out-of-core least squares problems.
> - an in-memory SSVD implementation that should work for matrices up to
> about 1/3 of available memory.
> - an out-of-core SSVD threaded implementation that should work for very
> large matrices. It should take time about equal to the cost of reading the
> input matrix 4 times and will require working disk roughly equal to the size
> of the input.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira