[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Dmitriy Lyubimov (JIRA) Wed, 09 Feb 2011 17:45:21 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992832#comment-12992832
 ]


Dmitriy Lyubimov commented on MAHOUT-322:
-----------------------------------------

i think technically one can't have sequence files marked with abstract key or 
value types. I might be wrong but i experimented with it enough times and 
haven't had any other outcome, but it is possible i am wrong.

If there's indeed such thing as a sequence file saved with WritableComparable 
as a key, one wouldn't be able to read it since sequence file reader wouldn't 
be able to instantiate key object based on an abstract class or interface. (It 
is possible to say i don't care what the key is by using 
SequenceFile.Reader<WritableComparable> but it's not the same thing.) so i 
assume the issue description really meant the ability of algorithms to accept 
arbitrary keys, but not the actual file having WritableComparable as key class 
in the header (which may imply assorted concrete types in the same file -- i 
don't think that's supported).

Anyway, i support that most of BLAS and not-so-BLAS stuff is(should be) key 
type agnostic. However, it is not the same as to say these algorithms should 
ignore the key (I don't ignore them). The way i solved that problem in 
Stochastic SVD was exactly this: i don't care what type the label of the rows 
is. The labels just get copied over into keys of corresponding rows in U 
output, and U is made sure to have the same class name saved for key as the 
input of A had. That's it. And i tested it with both IntWritable keys and Text 
(which is what i think the output of seq2sparse produces).

I think there will always be such paradigm: you just have a bunch of algorithms 
that care about label and those that don't. In that sense, it's nothing more 
than just a convention. I don't support coercing DRM as a format to have any 
particular key but i don't see a reason that a given algorithm might not 
require it to be something particular. (unless there's no good reason for that).


> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> 
> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>            Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix 
> states that the matrix lives in SequenceFile<WritableComparable, 
> VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, 
> VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD 
> package, mainly to perform PCA on a massive document corpus. Given such 
> corpus, it makes sense to not limit the user by forcing the document "key" to 
> be integer. Instead, users should be able to use Text keys (document name or 
> id) or keys made of any other arbitrary class. One may even argue that 
> forcing a WritableComparable key is too limiting, and a simple Writable key 
> should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the 
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout) 
> to be used as opaque keys even when their libraries are not available in 
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but 
> reader has methods to query just the value, avoiding key deserialization 
> altogether.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Reply via email to