I think that we should go with a labeling layer for these sorts of applications and not mess with the underlying matrix representation.
On Tue, Dec 8, 2009 at 12:50 PM, Jake Mannix <jake.man...@gmail.com> wrote: > For columns of a row-based matrix, I'm down with hashing or whatever. For > the rows on such matrices, inverting this is sometimes necessary (as Sean's > case shows). I'd hate to have an api with long row indexes and int column > indices though, that would be unacceptable. > > -jake > > On Tue, Dec 8, 2009 at 11:10 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > Systems like Vowpal Wabbit already support billions (and more) features, > > but > > they do it with the hashing trick and deal with possible collisions by > > multiple hashing. They claim support for as many as 10^12 features. > > > > As long as it is possible to avoid the overhead, I would be +0. If the > > overhead applies to all tasks then I would be -1. > > > > Scalability is quite possible without this. > > > > On Tue, Dec 8, 2009 at 3:08 AM, Grant Ingersoll <gsing...@apache.org> > > wrote: > > > > > How hard would it be to transparently support both? Could we have one > > > implementation for "smaller" problems and one for larger? > > > > > > At any rate, +1 to making this be available for really large scale. > > > > > > -Grant > > > > > > On Dec 8, 2009, at 3:16 AM, Sean Owen wrote: > > > > > > > I'm sure it's not hard. It makes (sparse) vectors consume that much > > > > more memory though. > > > > > > > > This change would certainly help my case, but I already have a bit of > > > > a workaround: I hash longs into ints and store the reverse mapping. > > > > There is possibility of collision but the consequence is small in the > > > > context of collaborative filtering. > > > > > > > > I suppose if I'm the only use case that would benefit at the moment, > > > > maybe not worth it, but if you can think of other reasons, let's > > > > change. > > > > > > > > On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <jake.man...@gmail.com> > > > wrote: > > > >> This brings up a point about our linear primitives: are 32bit > integers > > > big > > > >> enough for our index range for vectors and matrices? Especially for > > > >> matrices, > > > >> having billions of rows is completely possible, even if it is on the > > > large > > > >> side. > > > >> > > > >> If we want to be about "scalable" machine learning, we really don't > > want > > > to > > > >> seal ourselves in to "only" 2 billion x 2 billion matrices in the > long > > > run, > > > >> do we? > > > >> > > > >> How hard would it be to promote our ints to longs? > > > >> > > > >> -jake > > > >> > > > >> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sro...@gmail.com> wrote: > > > >> > > > >>> I'm trying to use Vectors to represent a vector of user > preferences. > > > >>> All is well since items are numeric and can be used as indexes into > a > > > >>> Vector -- almost. I have longs, and of course indexes are ints. > > > >>> > > > >>> I could fold the long IDs into ints without too much worry about > the > > > >>> effects of collision. However I still need to remember the original > > > >>> item IDs for each index. I could do it with labels, but I can't > > > >>> retrieve the label for an index (and the other mapping isn't > > > >>> serialized anyway?). > > > >>> > > > >>> So I guess I must separately store this mapping? Just making sure > I'm > > > >>> not missing something. > > > >>> > > > >> > > > > > > -------------------------- > > > Grant Ingersoll > > > http://www.lucidimagination.com/ > > > > > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using > > > Solr/Lucene: > > > http://www.lucidimagination.com/search > > > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > -- Ted Dunning, CTO DeepDyve