long keys are super useful for rows in a matrix (ids for documents), and
basically free in terms of memory (only one per document), but then for
symmetry we really do need them in the columns (keying on e.g. termId),
which is a not-insubstantial cost, but possibly worth it.

Our vectors would be (16* numNonZeroEntries) bytes in footprint.  That's
pretty hefty, but not too much more than 12.

There are arguments that most of the time, we don't need double values
either.  Sometimes, we don't need values at all (boolean data), but we
could certainly have special-purpose Vectors which carry no values and yet
return 1d for when the key is present.

But changing over all of our keys to long is a pretty big change.  Is it
worth it?


On Wed, Jun 19, 2013 at 10:25 AM, Sean Owen <[email protected]> wrote:

> I use 64-bit keys for vector-like data structures, and indeed you may
> pay a cost in extra RAM, but it has a lot of benefits in simplicity
> mostly, and making the probability of hash collisions ignorable even
> at huge scale. I think it's worthwhile overall.
>
> On Wed, Jun 19, 2013 at 6:16 PM, Robin Anil <[email protected]> wrote:
> > <rant>
> > Which joker thought of removing uint from Java?
> > </rant>
> >
> > Dan, the cost of moving to 64 bit for the index is extra RAM usage. My
> > experiments show that 32 bits is enough to hash down billions of
> features.
> > Do we ever need such Quadrillions of features? Can Machine learning truly
> > work at that scale. Think about these.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Wed, Jun 19, 2013 at 5:16 AM, Dan Filimon <
> [email protected]>wrote:
> >
> >> Also, this is particularly problematic because indices can't be
> negative so
> >> only 2^31 elements are actually possible.
> >>
> >>
> >> On Wed, Jun 19, 2013 at 1:15 PM, Dan Filimon <
> [email protected]
> >> >wrote:
> >>
> >> > Hi everyone!
> >> >
> >> > The current Vector API only supports 32bit maximum indices for
> Vectors.
> >> >
> >> > I feel that 64bits would be more appropriate especially because the
> >> > indices are likely to be hash values of other data and 32bit will
> result
> >> in
> >> > quite a few collisions.
> >> >
> >> > Also, for some jobs, notably ItemSimilarityJob, this restriction means
> >> > that we need a special id to index map where we'll collide anyway.
> >> >
> >> > What do you think about adding support for 64bit indices?
> >> > Is anyone at all interested?
> >> >
> >>
>



-- 

  -jake

Reply via email to