If i had to guess, the mapper reported time should be under 1 minute
regardless of the input size on any __non-vm__ machine (unless it is
IBM XT :) even with -Xmx200m which is hadoop default.

The reducer depends on the input size, but unless you manage to
generate 1000 mappers, i don't think it will jump out of 1 min either.

Thanks.
-Dmitriy

On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
<cendrillon1...@gmail.com> wrote:
> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it 
> dense.
>
> Let me try out some larger data sets and see how it runs. Do you have any 
> suggestions / expectations on performance that I should aim for? E.g. Given x 
> nodes and a y by y matrix the job should take around z minutes?
>
> As a follow up, would it be worth starting work on the 'brute force' job for 
> subtracting the average from each of the rows?
>
> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" 
> <j...@apache.org> wrote:
>
>>
>>    [ 
>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946
>>  ]
>>
>> Dmitriy Lyubimov commented on MAHOUT-923:
>> -----------------------------------------
>>
>> Raphael, thank you for seeing this thru.
>>
>> Q:
>> 1) -- why do you need vector class for the accumulator now? mean is kind of 
>> expected to be dense in the end, if not in the mappers then at least in the 
>> reducer for sure. And secondly, if you want to do this, why don't your api 
>> would accept a class instance, not a "short" name? that would be consistent 
>> with the Hadoop Job and file format apis which kind of take classes, not 
>> strings.
>>
>> 2) --  I know you have a unit test, but did you test it on a simulated 
>> input, like say 2G big? if not, i will have to test it before you proceed.
>>
>> As a next step, i guess i need to try it out to see if it works on various 
>> kind of inputs.
>>
>>> Row mean job for PCA
>>> --------------------
>>>
>>>                Key: MAHOUT-923
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>            Project: Mahout
>>>         Issue Type: Improvement
>>>         Components: Math
>>>   Affects Versions: 0.6
>>>           Reporter: Raphael Cendrillon
>>>           Assignee: Raphael Cendrillon
>>>            Fix For: Backlog
>>>
>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>
>>>
>>> Add map reduce job for calculating mean row (column-wise mean) of a 
>>> Distributed Row Matrix for use in PCA.
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA 
>> administrators: 
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>

Reply via email to