Thanks Lance. That makes a lot of sense. You're right regarding the need for combiners. What's the best way to create an Int + Vector writable pair? Should I just define one from scratch or is there some framework already in Mahout I should reuse?
Thanks again! On Dec 11, 2011, at 11:59 PM, Lance Norskog <goks...@gmail.com> wrote: > There is NullWritable as the key between mapper and reducer, and as > the first value in the pairs saved in a SequenceFile. As the > mapper->reducer key, it works. > > In mahout, SequenceFile vectors and matrices are stored as > <IntWritable,VectorWritable> pairs. Even though this job is in the > middle of another job, it should follow the convention. > > You do need to use only one reducer, and so combiners may be worthwhile. > > The person using this job knows the right vector to use. It may be > that it gets a lot of sparse vectors but will become a dense vector. > Or a vector that writes to a database. Or something else. In fact, I > may just want to turn a vector from Dense to Sparse, and I could > achieve that with this job. > > On Sun, Dec 11, 2011 at 7:58 PM, Raphael Cendrillon > <cendrillon1...@gmail.com> wrote: >> >> >>> On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote: >>>> Hm. I hope i did not read the code or miss something. >>>> >>>> 1 -- i am not sure this will actually work as intended unless # of >>>> reducers is corced to 1, of which i see no mention in the code. >>>> 2 -- mappers do nothing, passing on all the row pressure to sort which is >>>> absolutely not necessary. Even if you use combiners. This is going to be >>>> especially the case if you coerce 1 reducer an no combiners. IMO mean >>>> computation should be pushed up to mappers to avoid sort pressures of map >>>> reduce. Then reduction becomes largely symbolical(but you do need pass on >>>> the # of rows mapper has seen, to the reducer, in order for that operation >>>> to apply correctly). >>>> 3 -- i am not sure -- is NullWritable as a key legit? In my experience >>>> sequence file reader cannot instantiate it because NullWritable is a >>>> singleton and its creation is prohibited by making constructor private. >> >> Thanks Dmitry. >> >> Regarding 1, if I understand correctly the number of reducers depends on the >> number of unique keys. Since all keys are set to the same value (null), then >> all of the mapper outputs should arrive at the same reducer. This seems to >> work in the unit test, but I may be missing something? >> >> Regarding 2, that makes alot of sense. I'm wondering how many rows should be >> processed per mapper? I guess there is a trade-off between scalability >> (processing more rows within a single map job means that each row must have >> less columns) and speed? Is there someplace in the SSVD code where the >> matrix is split into slices of rows that I could use as a reference? >> >> Regarding 3, I believe NullWritable is OK. It's used pretty extensively in >> TimesSquaredJob in DistributedRowMatrx. However if you feel there is some >> disadvantage to this I could replace "NullWritable.get()" with "new >> IntWritable(1)" (that is, set all of the keys to 1). Would that be more >> suitable? >> >> >> - Raphael >> >> >> ----------------------------------------------------------- >> This is an automatically generated e-mail. To reply, visit: >> https://reviews.apache.org/r/3147/#review3838 >> ----------------------------------------------------------- >> >> >> On 2011-12-12 00:30:24, Raphael Cendrillon wrote: >>> >>> ----------------------------------------------------------- >>> This is an automatically generated e-mail. To reply, visit: >>> https://reviews.apache.org/r/3147/ >>> ----------------------------------------------------------- >>> >>> (Updated 2011-12-12 00:30:24) >>> >>> >>> Review request for mahout. >>> >>> >>> Summary >>> ------- >>> >>> Here's a patch with a simple job to calculate the row mean (column-wise >>> mean). One outstanding issue is the combiner, this requires a wrtiable >>> class IntVectorTupleWritable, where the Int stores the number of rows, and >>> the Vector stores the column-wise sum. >>> >>> >>> This addresses bug MAHOUT-923. >>> https://issues.apache.org/jira/browse/MAHOUT-923 >>> >>> >>> Diffs >>> ----- >>> >>> >>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java >>> 1213095 >>> >>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java >>> PRE-CREATION >>> >>> /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java >>> 1213095 >>> >>> Diff: https://reviews.apache.org/r/3147/diff >>> >>> >>> Testing >>> ------- >>> >>> Junit test >>> >>> >>> Thanks, >>> >>> Raphael >>> >>> >> > > > > -- > Lance Norskog > goks...@gmail.com