[
https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965608#action_12965608
]
Dmitriy Lyubimov commented on MAHOUT-376:
-----------------------------------------
yes, it is 100% streaming in terms of A and Y rows. Assumption is that we are
ok to load one A row into memory at a time and we optimize for tall matrices
(such as billion by million) Even if it is dense, one such vector would take
8MB memory at a time. but sparse sequential vectors should be ok too (it will
probably require a little tweak during Y computations to scan it one time
sequentially instead of k+p times as i think it is done now with assumption it
can be random).
For memory, the concern is random access q blocks which can be no less than k+p
by k+p (that is, for the case of k+p=500, it gets to be 2 Mb). But this is all
as far as memory is concerned. (well actually 2 times that, plus there's a Y
lookahead buffer in order to make sure we can safely form next block. Plus
there's a packed R. so for k+p=500 it looks like minimum memory requirement is
rougly in the area of 7-8Mb. which is well below anything).
CPU may be more of a problem, but i am actually not sure if Givens series would
produce more crunching than e.g. Householder's . Givens certainly is as
numerically stable as householder's and better than Gramm-Schmidt. In my tests
for 100k tall matrix the orthonormality residuals seem to hold at about no
less than 10e-13 and surprisingly i did not notice any degradataion at all
compared to smaller sizes. Actually I happened to read aobut LAPack methods
ithat prefer Givens for possiblity of re-ordering and thus easier
parallelization).
Anyway, speaking of numerical stability, whatever degradation occurs, i think
it would be dwarfed by stochastic inaccuracy which grows quite significantly in
my low rank tests. Perhaps for kp=500 it should degrade much less than for
20-30.
> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
> Key: MAHOUT-376
> URL: https://issues.apache.org/jira/browse/MAHOUT-376
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Reporter: Ted Dunning
> Assignee: Ted Dunning
> Fix For: 0.5
>
> Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for
> mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR
> decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf,
> sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working
> notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz,
> ssvd-m1.patch.gz, ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using
> eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.