DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> 
instead of SequenceFile<IntWritable,VectorWritable>
-----------------------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-322
                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
             Project: Mahout
          Issue Type: Improvement
          Components: Math
    Affects Versions: 0.3
            Reporter: Danny Leshem
            Priority: Minor
             Fix For: 0.3


Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix 
states that the matrix lives in SequenceFile<WritableComparable, 
VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, 
VectorWritable> is passed.

Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD 
package, mainly to perform PCA on a massive document corpus. Given such corpus, 
it makes sense to not limit the user by forcing the document "key" to be 
integer. Instead, users should be able to use Text keys (document name or id) 
or keys made of any other arbitrary class. One may even argue that forcing a 
WritableComparable key is too limiting, and a simple Writable key should be 
assumed.

In fact, it would be best if DistributedRowMatrix did not read the SequenceFile 
key at all, to allow user-specific classes (unknown to Mahout) to be used as 
opaque keys even when the their libraries are not available in runtime. 
Currently DistributedRowMatrix calls "reader.next(i, v)"... but reader has 
methods to query just the value, avoiding key deserialization altogether.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to