[ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841371#action_12841371 ]
Jake Mannix commented on MAHOUT-322: ------------------------------------ The implementation of some of the original methods of DistributedRowMatrix did not assume integer keys - in particular, DistributedRowMatrix.timesSquared(Vector) (used for SVD) needs to know nothing about the keys, only the values, and it certainly could be any Writable. The problem comes when you try to extend this to other methods: transpose(), times(Vector), and times(Matrix) *all* require that you have keys for the rows which match up to some other thing's keys (if you do transpose, and expect to return some form of VectorIterable back, then your original row-keys have to become the column *indexes* of the result. If you do times(Vector) and expect to get a Vector back, your original row-keys must turn into the indexes of the result vector, and so on). <quote> In fact, it would be best if DistributedRowMatrix did not read the SequenceFile key at all, to allow user-specific classes (unknown to Mahout) to be used as opaque keys even when their libraries are not available in runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but reader has methods to query just the value, avoiding key deserialization altogether. <quote> This is just for doing iteration. Iteration is a non-scalable operation (you're pulling data from HDFS back to wherever you are calling this from). The "meat" of a DistributedRowMatrix is in the hadoop jobs which are run when you call timesSquared(Vector), times(Vector), times(Matrix), etc (soon enough, transmutation methods like assign(UnaryFunction f) and the like will be added, which Map over the vectors). These methods really do require a choice to be made about the keys for the rows. This having been said, there is a notion in Mahout's matrix library of column and row bindings, which are Map<String, Integer>. Generalizing this to allow generic object keys for the row and column indexes for a DistributedRowMatrix is something we can consider. I would want to see what the use case is, however. Having keys for row be objects is one thing, but doing this all the time for the keys for the Vector indexes will seriously slow down inner loops, due to the translation time between object to int (via a multitude of hashCode() calls), and we treating the rows and columns on equal footing is pretty required. > DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> > instead of SequenceFile<IntWritable,VectorWritable> > ----------------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-322 > URL: https://issues.apache.org/jira/browse/MAHOUT-322 > Project: Mahout > Issue Type: Improvement > Components: Math > Affects Versions: 0.3 > Reporter: Danny Leshem > Priority: Minor > Fix For: 0.3 > > > Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix > states that the matrix lives in SequenceFile<WritableComparable, > VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, > VectorWritable> is passed. > Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD > package, mainly to perform PCA on a massive document corpus. Given such > corpus, it makes sense to not limit the user by forcing the document "key" to > be integer. Instead, users should be able to use Text keys (document name or > id) or keys made of any other arbitrary class. One may even argue that > forcing a WritableComparable key is too limiting, and a simple Writable key > should be assumed. > In fact, it would be best if DistributedRowMatrix did not read the > SequenceFile key at all, to allow user-specific classes (unknown to Mahout) > to be used as opaque keys even when their libraries are not available in > runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but > reader has methods to query just the value, avoiding key deserialization > altogether. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.