[ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841382#action_12841382 ]
Jake Mannix commented on MAHOUT-322: ------------------------------------ Meaning what, Robin? We can certainly come up with lots of ways to start with a SequenceFile<Writable,VectorWritable> and end up with a SequenceFile<IntWritable,VectorWritable> (I've currently do it by simply iterating over the sequence file assigning increasing integers - it's not scalable, but it's plenty fast enough for most purposes, and you only have to do it once per matrix - you store the output IntWritable,VectorWritable matrix and IntWritable,Writable dictionary files separately and they can be reused). > DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> > instead of SequenceFile<IntWritable,VectorWritable> > ----------------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-322 > URL: https://issues.apache.org/jira/browse/MAHOUT-322 > Project: Mahout > Issue Type: Improvement > Components: Math > Affects Versions: 0.3 > Reporter: Danny Leshem > Priority: Minor > Fix For: 0.3 > > > Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix > states that the matrix lives in SequenceFile<WritableComparable, > VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, > VectorWritable> is passed. > Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD > package, mainly to perform PCA on a massive document corpus. Given such > corpus, it makes sense to not limit the user by forcing the document "key" to > be integer. Instead, users should be able to use Text keys (document name or > id) or keys made of any other arbitrary class. One may even argue that > forcing a WritableComparable key is too limiting, and a simple Writable key > should be assumed. > In fact, it would be best if DistributedRowMatrix did not read the > SequenceFile key at all, to allow user-specific classes (unknown to Mahout) > to be used as opaque keys even when their libraries are not available in > runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but > reader has methods to query just the value, avoiding key deserialization > altogether. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.