Thanks Dmitry. I think I understand more clearly now. Are you saying I should 
make a map only job and then just use some post-processing to manually combine 
the map outputs?

How many rows should I process per map job?

On Dec 12, 2011, at 12:13 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

>> A combiner is definitely the next step.
> 
> It is definitely not. Why do you need to sort???
> 
>> One question, is there already a writable for tuples of e.g. int and Vector, 
>> or should I just write one from scratch?
> 
> From scratch.
> 
> Or, you can save n as first element in the vector, why not. Your front
> end code would know how to re-shuffle that.
> But if not that, then custom writable. TupleWritable saves the class
> with the value. That's exactly why they invented writables and not
> using java serialization: you must not save type with each value.
> 
> -d
> 
> 
> On Sun, Dec 11, 2011 at 8:14 PM, Raphael Cendrillon (Commented) (JIRA)
> <j...@apache.org> wrote:
>> 
>>    [ 
>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167341#comment-13167341
>>  ]
>> 
>> Raphael Cendrillon commented on MAHOUT-923:
>> -------------------------------------------
>> 
>> Thanks Lance. A combiner is definitely the next step. One question, is there 
>> already a writable for tuples of e.g. int and Vector, or should I just write 
>> one from scratch? I know there is TupleWritable, but from what I've read 
>> online it's better to avoid that unless you're doing a multiple input join.
>> 
>> Regarding the class for the output vector, are you saying that instead of 
>> inhereting the class from the rows of the DistributedRowMatrix you'd rather 
>> be able to specify this manually?
>> 
>> 
>> 
>>> Row mean job for PCA
>>> --------------------
>>> 
>>>                 Key: MAHOUT-923
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Math
>>>    Affects Versions: 0.6
>>>            Reporter: Raphael Cendrillon
>>>            Assignee: Raphael Cendrillon
>>>             Fix For: Backlog
>>> 
>>>         Attachments: MAHOUT-923.patch
>>> 
>>> 
>>> Add map reduce job for calculating mean row (column-wise mean) of a 
>>> Distributed Row Matrix for use in PCA.
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA 
>> administrators: 
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 

Reply via email to