Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Raphael Cendrillon Sun, 18 Dec 2011 14:42:06 -0800

Sure. Github is actually much easier for me. Generating patches while working 
on multiple jiras gets messy :)


On Dec 18, 2011, at 2:25 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> PS if it is not terribly difficult, if you could post your patch on
> github, it would be awesome (with complete mahout history based on
> git.apache.org/mahout)
> 
> Then we can merge it more easily in case it gets out of sync with the
> trunk HEAD.
> 
> Thank you for doing this.
> 
> 
> On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>> If i had to guess, the mapper reported time should be under 1 minute
>> regardless of the input size on any __non-vm__ machine (unless it is
>> IBM XT :) even with -Xmx200m which is hadoop default.
>> 
>> The reducer depends on the input size, but unless you manage to
>> generate 1000 mappers, i don't think it will jump out of 1 min either.
>> 
>> Thanks.
>> -Dmitriy
>> 
>> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
>> <cendrillon1...@gmail.com> wrote:
>>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it 
>>> dense.
>>> 
>>> Let me try out some larger data sets and see how it runs. Do you have any 
>>> suggestions / expectations on performance that I should aim for? E.g. Given 
>>> x nodes and a y by y matrix the job should take around z minutes?
>>> 
>>> As a follow up, would it be worth starting work on the 'brute force' job 
>>> for subtracting the average from each of the rows?
>>> 
>>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" 
>>> <j...@apache.org> wrote:
>>> 
>>>> 
>>>>    [ 
>>>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946
>>>>  ]
>>>> 
>>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>>> -----------------------------------------
>>>> 
>>>> Raphael, thank you for seeing this thru.
>>>> 
>>>> Q:
>>>> 1) -- why do you need vector class for the accumulator now? mean is kind 
>>>> of expected to be dense in the end, if not in the mappers then at least in 
>>>> the reducer for sure. And secondly, if you want to do this, why don't your 
>>>> api would accept a class instance, not a "short" name? that would be 
>>>> consistent with the Hadoop Job and file format apis which kind of take 
>>>> classes, not strings.
>>>> 
>>>> 2) --  I know you have a unit test, but did you test it on a simulated 
>>>> input, like say 2G big? if not, i will have to test it before you proceed.
>>>> 
>>>> As a next step, i guess i need to try it out to see if it works on various 
>>>> kind of inputs.
>>>> 
>>>>> Row mean job for PCA
>>>>> --------------------
>>>>> 
>>>>>                Key: MAHOUT-923
>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>>            Project: Mahout
>>>>>         Issue Type: Improvement
>>>>>         Components: Math
>>>>>   Affects Versions: 0.6
>>>>>           Reporter: Raphael Cendrillon
>>>>>           Assignee: Raphael Cendrillon
>>>>>            Fix For: Backlog
>>>>> 
>>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>> 
>>>>> 
>>>>> Add map reduce job for calculating mean row (column-wise mean) of a 
>>>>> Distributed Row Matrix for use in PCA.
>>>> 
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA 
>>>> administrators: 
>>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>> 
>>>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Reply via email to