[ 
https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842213#action_12842213
 ] 

Jake Mannix commented on MAHOUT-322:
------------------------------------

It should actually be noted that Danny's original point, that for example, SVD 
should be able to be done directly on a SequenceFile<Text,VectorWritable>, is 
both valid, and actually already taken care of.  While DistributedRowMatrix 
assumes that it's SequenceFile<IntWritable,VectorWritable> when building 
Iterator instances, it does not assume this when being given to a 
DistributedLanczosSolver.  The latter class will gladly find the right singular 
vectors of a matrix with keys of any WritableComparable class.   As Danny 
mentions, it's pretty easy to generalize this to Writable, and it's pretty easy 
to change the iterator to use a synthetic integer counter instead of using the 
keys, and walk over other Writable keys just as easily.

Now, such a SequenceFile<Writable,VectorWritable> would not be 
transpose()'able, or times(Vector) or times(DistributedRowMatrix) ready, but 
some nice error checking in the mappers for these jobs could be set up to spit 
out a nice exception telling the user about the situation, while iterator() and 
timesSquared() should still work fine.

> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> 
> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Priority: Minor
>             Fix For: 0.3
>
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix 
> states that the matrix lives in SequenceFile<WritableComparable, 
> VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, 
> VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD 
> package, mainly to perform PCA on a massive document corpus. Given such 
> corpus, it makes sense to not limit the user by forcing the document "key" to 
> be integer. Instead, users should be able to use Text keys (document name or 
> id) or keys made of any other arbitrary class. One may even argue that 
> forcing a WritableComparable key is too limiting, and a simple Writable key 
> should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the 
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout) 
> to be used as opaque keys even when their libraries are not available in 
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but 
> reader has methods to query just the value, avoiding key deserialization 
> altogether.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to