[
https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993294#comment-12993294
]
Dmitriy Lyubimov commented on MAHOUT-322:
-----------------------------------------
I can look at item 1 and 2 for DistributedRowMatrix class, sure. (Although i
suspect it's just a javadoc change). But i will be pretty slow, probably won't
get on it weekend after next. Given the scope is probably just the javadoc
(and philosophical) change for most part, probably there's someone who can do
it sooner.
I can definitely review MAHOUT-593 to reflect this.
> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable>
> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-322
> URL: https://issues.apache.org/jira/browse/MAHOUT-322
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.3
> Reporter: Danny Leshem
> Assignee: Jake Mannix
> Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix
> states that the matrix lives in SequenceFile<WritableComparable,
> VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable,
> VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD
> package, mainly to perform PCA on a massive document corpus. Given such
> corpus, it makes sense to not limit the user by forcing the document "key" to
> be integer. Instead, users should be able to use Text keys (document name or
> id) or keys made of any other arbitrary class. One may even argue that
> forcing a WritableComparable key is too limiting, and a simple Writable key
> should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout)
> to be used as opaque keys even when their libraries are not available in
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but
> reader has methods to query just the value, avoiding key deserialization
> altogether.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira