On Wed, Jun 19, 2013 at 12:09 PM, Dan Filimon
<[email protected]>wrote:

> I understand that it might not be worth it to have all vectors have 64bit
> indices all the time.
>
> My use case is very sparse vectors, that have IPs in them for example.
> A particular vector will likely have at most 20 IPs, but since Java doesn't
> have unsigned, even IPs need to be hashed (and they're 32bit! :).
>
> As for actual hashes, for the data I'm working with, I see 0.4% of the data
> being lost because of collisions. Granted, this is not much, but there's
> also not that much data...
> It may also be true that the hash function I'm using isn't the best (Java's
> hashCode() (on an unrelated note, any suggestions for a better one?).
>
> Sean, is this structure you're using available anywhere?
>
> What I'm proposing would not change all the Vectors (at least not at once).
> For instance, I'm thinking of a SequentialAccesSparseLongVector class. This
> would be straightforward, as we only need an OrderedLongDoubleMapping
> (which could be templatized like other math containers) and a long size.
> It could still support the same interface as Vector but add a few more
> functions... This would be my hack. :)
>

It would still implement Vector?  So you could pass in int indexes and get
back values?  And if you iterated over it, you'd iterate over Vector.Element
objects, which have int keys...  I'm not sure this would be as simple as you
imagine it would...


>
> As for a more long-term solution, perhaps also templatizing Vector to
> support ints and longs and have the primitive classes generated at compile
> time?
>

Yeah, we could pull up all the codegen from the mahout-collections up to
Vector.  That's a possibility.  Integrating that with the Mapreduce code
might be a little bit complex, but maybe not.


>
>
> On Wed, Jun 19, 2013 at 9:22 PM, Jake Mannix <[email protected]>
> wrote:
>
> > long keys are super useful for rows in a matrix (ids for documents), and
> > basically free in terms of memory (only one per document), but then for
> > symmetry we really do need them in the columns (keying on e.g. termId),
> > which is a not-insubstantial cost, but possibly worth it.
> >
> > Our vectors would be (16* numNonZeroEntries) bytes in footprint.  That's
> > pretty hefty, but not too much more than 12.
> >
> > There are arguments that most of the time, we don't need double values
> > either.  Sometimes, we don't need values at all (boolean data), but we
> > could certainly have special-purpose Vectors which carry no values and
> yet
> > return 1d for when the key is present.
> >
> > But changing over all of our keys to long is a pretty big change.  Is it
> > worth it?
> >
> >
> > On Wed, Jun 19, 2013 at 10:25 AM, Sean Owen <[email protected]> wrote:
> >
> > > I use 64-bit keys for vector-like data structures, and indeed you may
> > > pay a cost in extra RAM, but it has a lot of benefits in simplicity
> > > mostly, and making the probability of hash collisions ignorable even
> > > at huge scale. I think it's worthwhile overall.
> > >
> > > On Wed, Jun 19, 2013 at 6:16 PM, Robin Anil <[email protected]>
> > wrote:
> > > > <rant>
> > > > Which joker thought of removing uint from Java?
> > > > </rant>
> > > >
> > > > Dan, the cost of moving to 64 bit for the index is extra RAM usage.
> My
> > > > experiments show that 32 bits is enough to hash down billions of
> > > features.
> > > > Do we ever need such Quadrillions of features? Can Machine learning
> > truly
> > > > work at that scale. Think about these.
> > > >
> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > >
> > > >
> > > > On Wed, Jun 19, 2013 at 5:16 AM, Dan Filimon <
> > > [email protected]>wrote:
> > > >
> > > >> Also, this is particularly problematic because indices can't be
> > > negative so
> > > >> only 2^31 elements are actually possible.
> > > >>
> > > >>
> > > >> On Wed, Jun 19, 2013 at 1:15 PM, Dan Filimon <
> > > [email protected]
> > > >> >wrote:
> > > >>
> > > >> > Hi everyone!
> > > >> >
> > > >> > The current Vector API only supports 32bit maximum indices for
> > > Vectors.
> > > >> >
> > > >> > I feel that 64bits would be more appropriate especially because
> the
> > > >> > indices are likely to be hash values of other data and 32bit will
> > > result
> > > >> in
> > > >> > quite a few collisions.
> > > >> >
> > > >> > Also, for some jobs, notably ItemSimilarityJob, this restriction
> > means
> > > >> > that we need a special id to index map where we'll collide anyway.
> > > >> >
> > > >> > What do you think about adding support for 64bit indices?
> > > >> > Is anyone at all interested?
> > > >> >
> > > >>
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Reply via email to