[ 
https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992937#comment-12992937
 ] 

Sean Owen commented on MAHOUT-322:
----------------------------------

To clarify: there is nothing wrong, from a Java generics perspective, with 
"SequenceFile.Reader<? extends Writable,? extends Writable>". But that's not 
really the issue here.

Obviously, however, when creating a SequenceFile.Writer to write data, the 
caller has to specify concrete key/value at some point or else the class 
doesn't know how to write data.

SequenceFile happens to write the key/value class name to the file. So, the 
reader doesn't have to know the key/value class ahead of time when creating 
SequenceFile.Reader on the other side.

This is not generally true for Writables. Writables serializes just some 
sequence of bytes. The reader has to know what it's reading ahead of time to 
instantiate a Writable that can make sense of the bytes.

That is unless you write a Writable that does the same thing that SequenceFile 
does: writes the value class to disk too. That's Ted's trick. (This is what 
Vector used to do too. But it introduced way too much overhead. Just shows it's 
easy to write stuff that works but spends gigabytes "invisibly". It's not for 
all use cases.)


But yes I agree none of this is the core issue. It's whether the key in matrix 
serialization files is used. It is, right now, within the project; see 
MatrixSlice. That could be undone, perhaps. Dmitriy I'm not sure what your 
ultimate opinion is on that.

> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> 
> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>            Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix 
> states that the matrix lives in SequenceFile<WritableComparable, 
> VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, 
> VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD 
> package, mainly to perform PCA on a massive document corpus. Given such 
> corpus, it makes sense to not limit the user by forcing the document "key" to 
> be integer. Instead, users should be able to use Text keys (document name or 
> id) or keys made of any other arbitrary class. One may even argue that 
> forcing a WritableComparable key is too limiting, and a simple Writable key 
> should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the 
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout) 
> to be used as opaque keys even when their libraries are not available in 
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but 
> reader has methods to query just the value, avoiding key deserialization 
> altogether.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to