[ 
https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965597#action_12965597
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-376 at 12/1/10 2:48 AM:
------------------------------------------------------------------

Final trunk patch for CDH3 or 0.21 api. 
This includes code cleanup, javadoc updates, and mahout CLI class (not tested 
though). 

all existing tests and this test are passing. I tested 100Kx100 matrix in local 
mode only, S values coincide with 1e-10 or better.

changes to dependencies i had to make 
* hadoop 0.21 or cdh3 to support multiple outputs 
* local MR mode has dependency on commons-http client, so i included it for 
test scope only in order for test to work
*  changed apache-math dependency from 1.2 (?) to 2.1. Actually mahout math 
module seems to depend on 2.1 too, not clear why it was not transitive for this 
one. 
* commons-math 1.2 seemed to have depended on commons-cli and 2.1 doesn't have 
it transitively anymore, but one of  the classes in core required it. so i 
added commons-cli in order to fix the build.

*Ted*, sorry i kind of polluted your issue here. Thank you for your 
encouragement and help. i probably should've opened another issue once it was 
clear it diverged far enough, instead of keep putting stuff here. 

This should be compatible with DistributedRowMatrix. I did not have real 
distributed test yet as i don't have a suitable data set yet, but perhaps 
somebody in the user community with the interest in the method could do it 
faster than i get to it. I will do tests with moderate scale at some point but 
i don't want to do it on my company's machine cluster yet and i don't exactly 
own a good one myself.

I did have a rather mixed use of mahout vector math and just dense arrays. 
Partly becuase i did not quite have enough time to study all capabilities in 
math module, and partly becuase i wanted explicit access to memory for control 
over its more efficient re-use in mass iterations.  This may or may not need be 
rectified over time. But it seems to work pretty well as is.

The patch is git patch (so one needs to use patch -p1 instead of -p0). I know 
the standard is set to use svn patches... but i already used git for pulling 
the trunk  (so happens i prefer git in general too so i can have my own commit 
tree and branching for this work). 

If there's enough interest from the project to this contribution, i will 
support it, and if requested, i can port it to 0.20 if that's the target 
platform for 0.5, as well as doing other specific mahout architectural tweaks.  
Please kindly let me know. 


Thank you.

      was (Author: dlyubimov2):
    Final trunk patch for CDH3 or 0.21 api. 
This includes code cleanup, javadoc updates, and mahout CLI class (not tested 
though). 

all existing tests and this test are passing. I tested 100Kx100 matrix in local 
mode only, S values coincide with 1e-10 or better.

changes to dependencies i had to make 
* hadoop 0.21 or cdh3 to support multiple outputs 
* local MR mode has dependency on commons-http client, so i included it for 
test scope only in order for test to work
*  changed apache-math dependency from 1.2 (?) to 2.1. Actually mahout math 
module seems to depend on 2.1 too, not clear why it was not transitive for this 
one. 
* commons-math 1.2 seemed to have depended on commons-cli and 2.1 doesn't have 
it transitively anymore, but one of  the classes in core required it. so i 
added commons-cli in order to fix the build.

*Ted*, sorry i kind of polluted your issue here. Thank you for your 
encouragement and help. i probably should've opened another issue once it was 
clear it diverged far enough, instead of keep putting stuff here. 

This should be compatible with DistributedRowMatrix. I did not have real 
distributed test yet as i don't have a suitable data set yet, but perhaps 
somebody in the user community with the interest in the method could do it 
faster than i get to it. I will do tests with moderate scale at some point but 
i don't want to do it on my company's grounds yet and i don't exactly own a 
good one myself.

I did have a rather mixed use of mahout vector math and just dense arrays. 
Partly becuase i did not quite have enough time to study all capabilities in 
math module, and partly becuase i wanted explicit access to memory for control 
over its more efficient re-use in mass iterations.  This may or may not need be 
rectified over time. But it seems to work pretty well as is.

The patch is git patch (so one needs to use patch -p1 instead of -p0). I know 
the standard set to use svn patches... but i already used git for pulling the 
trunk  (i prefer git in general too). 

If there's enough interest from the project to this contribution, i will 
support it, and if requested, i can port it to 0.20 if that's the target 
platform for 0.5, as well as doing other specific mahout architectural tweaks.  
Please kindly let me know. 


Thank you.
  
> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
>                 Key: MAHOUT-376
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-376
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>         Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for 
> mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR 
> decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf, 
> sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working 
> notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz, 
> ssvd-m1.patch.gz, ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using 
> eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to