[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Dmitriy Lyubimov (JIRA) Thu, 10 Feb 2011 11:40:23 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993185#comment-12993185
 ]


Dmitriy Lyubimov commented on MAHOUT-322:
-----------------------------------------


{quote}Dmitriy I'm not sure what your ultimate opinion is on that.{quote}
Sorry Sean. Let me clarify what i am thinking by example.
{quote}But yes the docs could be adjusted to reflect reality.{quote}

*1) General DRM format definition.*
if you mean the correction should be "DRM format is one or more sequence files 
of (IntWritable,VectorWritable)" pairs, I probably don't support this. Counter 
example is seq2sparse -> Stochastic SVD code . That's how we use it for LSI. 
seq2sparse produces Text labels for keys (i need to check again, but i am 
pretty sure that's what it is). So perhaps i'd favor for it to say "DRM files 
are sequence files with (? extends WritableComparable, VectorWritable) pairs". 
I.e. more or less how the issue implies it is today. But then we proceed, "Some 
algorithms may add particular limitations on what the sequence key might be". 
*2) Concrete algorithm docs*
Concrete algorithm docs should define how they work with the keys and what they 
require they should be. E.g. i imagine DRM.transpose() may require the keys to 
be ints since it needs to convert row indices into positional indices in the 
rows of the transposed matrix. so it probably should say "transpose op. 
requires rowKeys to be IntWritable. The output rows will also have IntWritable 
as a key". The implementation would throw an error if key is not what it 
expects. Similarly, SVD contract should say "SVD copies row keys into keys of U 
matrix but does not require them to be any particular type  other than 
extending WritableComparable. SVD outputs IntWritable as key of V output". 
That's the contract i use in LSI pipeline for stochastic SVD. 
*3) implementations need to be fixed if they do key coercion without a good 
reason.* 
I suspect Lanczos SVD might be the case. There's no theoretical reason why the 
SVD contract i showed above can't be supported. It may be inconvenient for 
implementation to support it, but fundamentally there's no reason   not to be 
key-agnostic there.
I am not sure about MatrixSlice, whether it has a good reason for what it does 
or not. either way it probably should document that for algorithms that depend 
on that. I can look at it in more detail if you want me to.

In short, i am lobbying for my use case of LSI :) if you coerce any matrix 
exchange to have IntWritable for keys, my LSI pipeline would require more steps 
to do some conversions from file names as keys into some hash keys.

> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> 
> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>            Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix 
> states that the matrix lives in SequenceFile<WritableComparable, 
> VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, 
> VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD 
> package, mainly to perform PCA on a massive document corpus. Given such 
> corpus, it makes sense to not limit the user by forcing the document "key" to 
> be integer. Instead, users should be able to use Text keys (document name or 
> id) or keys made of any other arbitrary class. One may even argue that 
> forcing a WritableComparable key is too limiting, and a simple Writable key 
> should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the 
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout) 
> to be used as opaque keys even when their libraries are not available in 
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but 
> reader has methods to query just the value, avoiding key deserialization 
> altogether.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Reply via email to