[
https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993194#comment-12993194
]
Ted Dunning commented on MAHOUT-322:
------------------------------------
I like Item 1 (a lot).
Item 2 is implied by item 1 so I am on-board for that.
Item 3 can be done lazily, except that the SVD really out to work. This is a
very natural work-flow and having doc-ids on the rows at the end would really
be handy.
> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable>
> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-322
> URL: https://issues.apache.org/jira/browse/MAHOUT-322
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.3
> Reporter: Danny Leshem
> Assignee: Jake Mannix
> Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix
> states that the matrix lives in SequenceFile<WritableComparable,
> VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable,
> VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD
> package, mainly to perform PCA on a massive document corpus. Given such
> corpus, it makes sense to not limit the user by forcing the document "key" to
> be integer. Instead, users should be able to use Text keys (document name or
> id) or keys made of any other arbitrary class. One may even argue that
> forcing a WritableComparable key is too limiting, and a simple Writable key
> should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout)
> to be used as opaque keys even when their libraries are not available in
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but
> reader has methods to query just the value, avoiding key deserialization
> altogether.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira