IDs for Vectors and Matrices

Pat Ferrel (JIRA) Sat, 05 Apr 2014 14:24:06 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961241#comment-13961241
 ]

Pat Ferrel edited comment on MAHOUT-1507 at 4/5/14 9:22 PM:
------------------------------------------------------------

It is rather inelegant but a general arbitrarily scalable way to create a 
bi-directional index that a user could employ when creating input might be one 
way to tackle this. 

When I do it I have a single machine create an in-memory BiHashMap (Guava). 
Then I write this as a text delimited HDFS file. I do this for anything that 
needs a mahout id--rows and/or columns.

Then when I create mahout input in mapreduce every node must create this 
in-memory BiHashMap only once for access by any job on the Node. This is not 
arbitrarily scalable and also rather crude. 

When I get output from Mahout I do the same and output to a sequence file or 
text file or database with external IDs. 

There are mapreduce ways to create indexes that I've used with PIG but haven't 
had the chance to do this in Java. Joining the index with output mahout ids can 
also be fully scalable.

For full column and row keys, maybe a scalable version of this in and out 
translation is sufficient especially if combined with Vectors that maintain 
some external row key. 

Maybe certain algorithms could support this in some special way. RSJ for 
instance need only keep column keys for a self-join, a cross RSJ would be more 
problematic 

We get questions on the user list often about what Mahout uses as ids. Some 
people mistakenly think they can supply their own so this seems to be a rather 
fundamental problem to be solved. I think attempts to make this easier to do 
were made in Mahout early on with PropertyVectors and NamedVectors I'm just 
being the squeaky wheel to make sure we don't loose track of this usability 
issue.

was (Author: pferrel):
It is rather inelegant but a general arbitrarily scalable way to create a 
bi-directional index that a user could employ when creating input might be one 
way to tackle this. 

When I do it I have a single machine create an in-memory BiHashMap (Guava). 
Then I write this as a text delimited HDFS file. I do this for anything that 
needs a mahout id--rows and/or columns.

Then when I create mahout input in mapreduce every node must create this 
in-memory BiHashMap only once for access by any job on the Node. This is not 
arbitrarily scalable and also rather crude. 

When I get output from Mahout I do the same and output to a sequence file or 
text file or database with external IDs. 

There are mapreduce ways to create indexes that I've used with PIG but haven't 
had the chance to do this in Java. Joining the index with output mahout ids can 
also be fully scalable.

For full column and row keys, maybe a scalable version of this in and out 
translation is sufficient especially if combined Vectors that maintain some 
external row key. It would solve most of the use cases I've run into.

Maybe certain algorithms could support this in some special way. RSJ for 
instance need only keep column keys for a self-join, a cross RSJ would be more 
problematic 

We get questions on the user list often about what Mahout uses as ids. Some 
people mistakenly think they can supply their own so this seems to be a rather 
fundamental problem to be solved. I think attempts to make this easier to do 
were made in Mahout early on with PropertyVectors and NamedVectors I'm just 
being the squeaky wheel to make sure we don't loose track of this usability 
issue.

> Support External/Foreign Keys/IDs for Vectors and Matrices
> ----------------------------------------------------------
>
>                 Key: MAHOUT-1507
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1507
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.9
>         Environment: Spark Scala
>            Reporter: Pat Ferrel
>              Labels: spark
>             Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel <[email protected]> wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <[email protected]> wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <[email protected]> wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <[email protected]> wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <[email protected]> wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. clustering
> algorithms are not seeing them because they expect them to be the name part
> of the vector. I am just saying we have two ways to name the rows, and it
> is generally not a healthy choice for the aforementioned reason.
> In my experience Names and Properties are primarily used to store
> external keys, which are quite healthy.
> Users never have data with Mahout keys, they must constantly go back and
> forth. This is exactly what the R data frame does, no? I'm not so concerned
> with being able to address an element by the external key
> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
> external ids follow the data through any calculation that makes sense.
> I am with you on this.
> This would mean clustering, recommendations, transpose, RSJ would require
> no id transforming steps. This would make dealing with Mahout much easier.
> Data frames is a little bit a different thing, right now we work just with
> matrices. Although, yes, our in-core matrices support row and column names
> (just like in R) and distributed matrices support row keys only.  what i
> mean is that algebraic expression e.g.
> Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> above, but not necessarily named vectors, because internally algorithms
> blockify things into matrix blocks, and i am far from sure that Mahout
> in-core stuff works correctly with named vectors as part of a matrix block
> in all situations. I may be wrong. I always relied on sequence file keys to
> identify data points.
> Note that sequence file keys are bigger than just a name, it is anything
> Writable. I.e. you could save a data structure there, as long as you have a
> Writable for it.
> On Apr 2, 2014 1:08 PM, "Pat Ferrel" <[email protected]> wrote:
> Are the Spark efforts supporting all Mahout Vector types? Named,
> Property
> Vectors? It occurred to me that data frames in R is a related but more
> general solution. If all rows and columns of a DRM and their
> coresponding
> Vectors (row or column vectors) were to support arbitrary properties
> attached to them in such a way that they are preserved during
> transpose,
> Vector extraction, and any other operations that make sense there
> would be
> a huge benefit for users.
> One of the constant problems with input to Mahout is translation of
> IDs.
> External to Mahout going in, Mahout to external coming out. Most of
> this
> would be unneeded if Mahout supported data frames, some would be
> avoided by
> supporting named or property vectors universally.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

Reply via email to