If i had to guess, the mapper reported time should be under 1 minute regardless of the input size on any __non-vm__ machine (unless it is IBM XT :) even with -Xmx200m which is hadoop default.
The reducer depends on the input size, but unless you manage to generate 1000 mappers, i don't think it will jump out of 1 min either. Thanks. -Dmitriy On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon <cendrillon1...@gmail.com> wrote: > Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it > dense. > > Let me try out some larger data sets and see how it runs. Do you have any > suggestions / expectations on performance that I should aim for? E.g. Given x > nodes and a y by y matrix the job should take around z minutes? > > As a follow up, would it be worth starting work on the 'brute force' job for > subtracting the average from each of the rows? > > On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" > <j...@apache.org> wrote: > >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 >> ] >> >> Dmitriy Lyubimov commented on MAHOUT-923: >> ----------------------------------------- >> >> Raphael, thank you for seeing this thru. >> >> Q: >> 1) -- why do you need vector class for the accumulator now? mean is kind of >> expected to be dense in the end, if not in the mappers then at least in the >> reducer for sure. And secondly, if you want to do this, why don't your api >> would accept a class instance, not a "short" name? that would be consistent >> with the Hadoop Job and file format apis which kind of take classes, not >> strings. >> >> 2) -- I know you have a unit test, but did you test it on a simulated >> input, like say 2G big? if not, i will have to test it before you proceed. >> >> As a next step, i guess i need to try it out to see if it works on various >> kind of inputs. >> >>> Row mean job for PCA >>> -------------------- >>> >>> Key: MAHOUT-923 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-923 >>> Project: Mahout >>> Issue Type: Improvement >>> Components: Math >>> Affects Versions: 0.6 >>> Reporter: Raphael Cendrillon >>> Assignee: Raphael Cendrillon >>> Fix For: Backlog >>> >>> Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch >>> >>> >>> Add map reduce job for calculating mean row (column-wise mean) of a >>> Distributed Row Matrix for use in PCA. >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA >> administrators: >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >>