I think that we should go with a labeling layer for these sorts of
applications and not mess with the underlying matrix representation.

On Tue, Dec 8, 2009 at 12:50 PM, Jake Mannix <jake.man...@gmail.com> wrote:

> For columns of a row-based matrix, I'm down with hashing or whatever.  For
> the rows on such matrices, inverting this is sometimes necessary (as Sean's
> case shows).  I'd hate to have an api with long row indexes and int column
> indices though, that would be unacceptable.
>
>  -jake
>
> On Tue, Dec 8, 2009 at 11:10 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Systems like Vowpal Wabbit already support billions (and more) features,
> > but
> > they do it with the hashing trick and deal with possible collisions by
> > multiple hashing.  They claim support for as many as 10^12 features.
> >
> > As long as it is possible to avoid the overhead, I would be +0.  If the
> > overhead applies to all tasks then I would be -1.
> >
> > Scalability is quite possible without this.
> >
> > On Tue, Dec 8, 2009 at 3:08 AM, Grant Ingersoll <gsing...@apache.org>
> > wrote:
> >
> > > How hard would it be to transparently support both?  Could we have one
> > > implementation for "smaller" problems and one for larger?
> > >
> > > At any rate, +1 to making this be available for really large scale.
> > >
> > > -Grant
> > >
> > > On Dec 8, 2009, at 3:16 AM, Sean Owen wrote:
> > >
> > > > I'm sure it's not hard. It makes (sparse) vectors consume that much
> > > > more memory though.
> > > >
> > > > This change would certainly help my case, but I already have a bit of
> > > > a workaround: I hash longs into ints and store the reverse mapping.
> > > > There is possibility of collision but the consequence is small in the
> > > > context of collaborative filtering.
> > > >
> > > > I suppose if I'm the only use case that would benefit at the moment,
> > > > maybe not worth it, but if you can think of other reasons, let's
> > > > change.
> > > >
> > > > On Tue, Dec 8, 2009 at 5:48 AM, Jake Mannix <jake.man...@gmail.com>
> > > wrote:
> > > >> This brings up a point about our linear primitives: are 32bit
> integers
> > > big
> > > >> enough for our index range for vectors and matrices?  Especially for
> > > >> matrices,
> > > >> having billions of rows is completely possible, even if it is on the
> > > large
> > > >> side.
> > > >>
> > > >> If we want to be about "scalable" machine learning, we really don't
> > want
> > > to
> > > >> seal ourselves in to "only" 2 billion x 2 billion matrices in the
> long
> > > run,
> > > >> do we?
> > > >>
> > > >> How hard would it be to promote our ints to longs?
> > > >>
> > > >>  -jake
> > > >>
> > > >> On Sat, Dec 5, 2009 at 4:48 AM, Sean Owen <sro...@gmail.com> wrote:
> > > >>
> > > >>> I'm trying to use Vectors to represent a vector of user
> preferences.
> > > >>> All is well since items are numeric and can be used as indexes into
> a
> > > >>> Vector -- almost. I have longs, and of course indexes are ints.
> > > >>>
> > > >>> I could fold the long IDs into ints without too much worry about
> the
> > > >>> effects of collision. However I still need to remember the original
> > > >>> item IDs for each index. I could do it with labels, but I can't
> > > >>> retrieve the label for an index (and the other mapping isn't
> > > >>> serialized anyway?).
> > > >>>
> > > >>> So I guess I must separately store this mapping? Just making sure
> I'm
> > > >>> not missing something.
> > > >>>
> > > >>
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://www.lucidimagination.com/
> > >
> > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using
> > > Solr/Lucene:
> > > http://www.lucidimagination.com/search
> > >
> > >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to