-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------
(Updated 2011-12-16 19:09:13.382909)
Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.
Changes
-------
Modified to accept any writable as the value (instead of just VectorWritable).
This still requires the generic class PairWritable to be extended for each
class of interest so that this extended class can be passed into
setMapOutputValueClass(). I'm not sure if this is the best approach, any
suggestions would be appreciated!
Summary
-------
Early support for randomizing input in SplitInput class. This is an early start
but I've posted it up just to check if I'm on the right track. A couple of
comments:
- currently the code runs through the entire file looking for the line
corresponding to the random index. This has to be repeated for every line,
which is slow and somewhat ugly.
- the permutation indices are stored in an array. This could lead to scaling
issues if the number of input lines is large. This problem may also exist with
ridx in the existing code. One option is to use a linear feedback shift
register to generate a permutation sequence on the fly.
Any suggestions would be very welcome!
This addresses bug MAHOUT-904.
https://issues.apache.org/jira/browse/MAHOUT-904
Diffs (updated)
-----
/trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java
PRE-CREATION
/trunk/integration/src/main/java/org/apache/mahout/utils/PairWritable.java
PRE-CREATION
/trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java
PRE-CREATION
/trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java
PRE-CREATION
Diff: https://reviews.apache.org/r/3092/diff
Testing
-------
Thanks,
Raphael