[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Jake Mannix (JIRA) Thu, 04 Mar 2010 08:37:51 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841371#action_12841371
 ]


Jake Mannix commented on MAHOUT-322:
------------------------------------

The implementation of some of the original methods of DistributedRowMatrix did 
not assume integer keys - in particular, 
DistributedRowMatrix.timesSquared(Vector) (used for SVD) needs to know nothing 
about the keys, only the values, and it certainly could be any Writable.

The problem comes when you try to extend this to other methods: transpose(), 
times(Vector), and times(Matrix) *all* require that you have keys for the rows 
which match up to some other thing's keys (if you do transpose, and expect to 
return some form of VectorIterable back, then your original row-keys have to 
become the column *indexes* of the result.  If you do times(Vector) and expect 
to get a Vector back, your original row-keys must turn into the indexes of the 
result vector, and so on).

<quote>
In fact, it would be best if DistributedRowMatrix did not read the SequenceFile 
key at all, to allow user-specific classes (unknown to Mahout) to be used as 
opaque keys even when their libraries are not available in runtime. Currently 
DistributedRowMatrix calls "reader.next(i, v)"... but reader has methods to 
query just the value, avoiding key deserialization altogether.
<quote>

This is just for doing iteration.  Iteration is a non-scalable operation 
(you're pulling data from HDFS back to wherever you are calling this from).  
The "meat" of a DistributedRowMatrix is in the hadoop jobs which are run when 
you call timesSquared(Vector), times(Vector), times(Matrix), etc (soon enough, 
transmutation methods like assign(UnaryFunction f) and the like will be added, 
which Map over the vectors).    

These methods really do require a choice to be made about the keys for the 
rows.  This having been said, there is a notion in Mahout's matrix library of 
column and row bindings, which are Map<String, Integer>.  Generalizing this to 
allow generic object keys for the row and column indexes for a 
DistributedRowMatrix is something we can consider.  I would want to see what 
the use case is, however.  Having keys for row be objects is one thing, but 
doing this all the time for the keys for the Vector indexes will seriously slow 
down inner loops, due to the translation time between object to int (via a 
multitude of hashCode() calls), and we treating the rows and columns on equal 
footing is pretty required.


> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> 
> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Priority: Minor
>             Fix For: 0.3
>
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix 
> states that the matrix lives in SequenceFile<WritableComparable, 
> VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, 
> VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD 
> package, mainly to perform PCA on a massive document corpus. Given such 
> corpus, it makes sense to not limit the user by forcing the document "key" to 
> be integer. Instead, users should be able to use Text keys (document name or 
> id) or keys made of any other arbitrary class. One may even argue that 
> forcing a WritableComparable key is too limiting, and a simple Writable key 
> should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the 
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout) 
> to be used as opaque keys even when their libraries are not available in 
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but 
> reader has methods to query just the value, avoiding key deserialization 
> altogether.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Reply via email to