Re: Review Request: Row mean job for PCA

Raphael Cendrillon Mon, 12 Dec 2011 00:05:58 -0800

Thanks Lance. That makes a lot of sense.

You're right regarding the need for combiners. What's the best way to create an 
Int + Vector writable pair? Should I just define one from scratch or is there 
some framework already in Mahout I should reuse?


Thanks again! 

On Dec 11, 2011, at 11:59 PM, Lance Norskog <goks...@gmail.com> wrote:

> There is NullWritable as the key between mapper and reducer, and as
> the first value in the pairs saved in a SequenceFile. As the
> mapper->reducer key, it works.
> 
> In mahout, SequenceFile vectors and matrices are stored as
> <IntWritable,VectorWritable> pairs. Even though this job is in the
> middle of another job, it should follow the convention.
> 
> You do need to use only one reducer, and so combiners may be worthwhile.
> 
> The person using this job knows the right vector to use. It may be
> that it gets a lot of sparse vectors but will become a dense vector.
> Or a vector that writes to a database. Or something else. In fact, I
> may just want to turn a vector from Dense to Sparse, and I could
> achieve that with this job.
> 
> On Sun, Dec 11, 2011 at 7:58 PM, Raphael Cendrillon
> <cendrillon1...@gmail.com> wrote:
>> 
>> 
>>> On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
>>>> Hm. I hope i did not read the code or miss something.
>>>> 
>>>> 1 -- i am not sure this will actually work as intended unless # of 
>>>> reducers is corced to 1, of which i see no mention in the code.
>>>> 2 -- mappers do nothing, passing on all the row pressure to sort which is 
>>>> absolutely not necessary. Even if you use combiners. This is going to be 
>>>> especially the case if you coerce 1 reducer an no combiners. IMO mean 
>>>> computation should be pushed up to mappers to avoid sort pressures of map 
>>>> reduce. Then reduction becomes largely symbolical(but you do need pass on 
>>>> the # of rows mapper has seen, to the reducer, in order for that operation 
>>>> to apply correctly).
>>>> 3 -- i am not sure -- is NullWritable as a key legit? In my experience 
>>>> sequence file reader cannot instantiate it because NullWritable is a 
>>>> singleton and its creation is prohibited by making constructor private.
>> 
>> Thanks Dmitry.
>> 
>> Regarding 1, if I understand correctly the number of reducers depends on the 
>> number of unique keys. Since all keys are set to the same value (null), then 
>> all of the mapper outputs should arrive at the same reducer. This seems to 
>> work in the unit test, but I may be missing something?
>> 
>> Regarding 2, that makes alot of sense. I'm wondering how many rows should be 
>> processed per mapper?  I guess there is a trade-off between scalability 
>> (processing more rows within a single map job means that each row must have 
>> less columns) and speed?  Is there someplace in the SSVD code where the 
>> matrix is split into slices of rows that I could use as a reference?
>> 
>> Regarding 3, I believe NullWritable is OK. It's used pretty extensively in 
>> TimesSquaredJob in DistributedRowMatrx. However if you feel there is some 
>> disadvantage to this I could replace "NullWritable.get()" with "new 
>> IntWritable(1)" (that is, set all of the keys to 1). Would that be more 
>> suitable?
>> 
>> 
>> - Raphael
>> 
>> 
>> -----------------------------------------------------------
>> This is an automatically generated e-mail. To reply, visit:
>> https://reviews.apache.org/r/3147/#review3838
>> -----------------------------------------------------------
>> 
>> 
>> On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
>>> 
>>> -----------------------------------------------------------
>>> This is an automatically generated e-mail. To reply, visit:
>>> https://reviews.apache.org/r/3147/
>>> -----------------------------------------------------------
>>> 
>>> (Updated 2011-12-12 00:30:24)
>>> 
>>> 
>>> Review request for mahout.
>>> 
>>> 
>>> Summary
>>> -------
>>> 
>>> Here's a patch with a simple job to calculate the row mean (column-wise 
>>> mean). One outstanding issue is the combiner, this requires a wrtiable 
>>> class IntVectorTupleWritable, where the Int stores the number of rows, and 
>>> the Vector stores the column-wise sum.
>>> 
>>> 
>>> This addresses bug MAHOUT-923.
>>>    https://issues.apache.org/jira/browse/MAHOUT-923
>>> 
>>> 
>>> Diffs
>>> -----
>>> 
>>>  
>>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
>>>  1213095
>>>  
>>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
>>>  PRE-CREATION
>>>  
>>> /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java
>>>  1213095
>>> 
>>> Diff: https://reviews.apache.org/r/3147/diff
>>> 
>>> 
>>> Testing
>>> -------
>>> 
>>> Junit test
>>> 
>>> 
>>> Thanks,
>>> 
>>> Raphael
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com

Re: Review Request: Row mean job for PCA

Reply via email to