Have you tried murmur3 hash. We have an impl in Mahout
On Jun 19, 2013 2:43 PM, "Jake Mannix" <[email protected]> wrote:

> On Wed, Jun 19, 2013 at 12:09 PM, Dan Filimon
> <[email protected]>wrote:
>
> > I understand that it might not be worth it to have all vectors have 64bit
> > indices all the time.
> >
> > My use case is very sparse vectors, that have IPs in them for example.
> > A particular vector will likely have at most 20 IPs, but since Java
> doesn't
> > have unsigned, even IPs need to be hashed (and they're 32bit! :).
> >
> > As for actual hashes, for the data I'm working with, I see 0.4% of the
> data
> > being lost because of collisions. Granted, this is not much, but there's
> > also not that much data...
> > It may also be true that the hash function I'm using isn't the best
> (Java's
> > hashCode() (on an unrelated note, any suggestions for a better one?).
> >
> > Sean, is this structure you're using available anywhere?
> >
> > What I'm proposing would not change all the Vectors (at least not at
> once).
> > For instance, I'm thinking of a SequentialAccesSparseLongVector class.
> This
> > would be straightforward, as we only need an OrderedLongDoubleMapping
> > (which could be templatized like other math containers) and a long size.
> > It could still support the same interface as Vector but add a few more
> > functions... This would be my hack. :)
> >
>
> It would still implement Vector?  So you could pass in int indexes and get
> back values?  And if you iterated over it, you'd iterate over
> Vector.Element
> objects, which have int keys...  I'm not sure this would be as simple as
> you
> imagine it would...
>
>
> >
> > As for a more long-term solution, perhaps also templatizing Vector to
> > support ints and longs and have the primitive classes generated at
> compile
> > time?
> >
>
> Yeah, we could pull up all the codegen from the mahout-collections up to
> Vector.  That's a possibility.  Integrating that with the Mapreduce code
> might be a little bit complex, but maybe not.
>
>
> >
> >
> > On Wed, Jun 19, 2013 at 9:22 PM, Jake Mannix <[email protected]>
> > wrote:
> >
> > > long keys are super useful for rows in a matrix (ids for documents),
> and
> > > basically free in terms of memory (only one per document), but then for
> > > symmetry we really do need them in the columns (keying on e.g. termId),
> > > which is a not-insubstantial cost, but possibly worth it.
> > >
> > > Our vectors would be (16* numNonZeroEntries) bytes in footprint.
>  That's
> > > pretty hefty, but not too much more than 12.
> > >
> > > There are arguments that most of the time, we don't need double values
> > > either.  Sometimes, we don't need values at all (boolean data), but we
> > > could certainly have special-purpose Vectors which carry no values and
> > yet
> > > return 1d for when the key is present.
> > >
> > > But changing over all of our keys to long is a pretty big change.  Is
> it
> > > worth it?
> > >
> > >
> > > On Wed, Jun 19, 2013 at 10:25 AM, Sean Owen <[email protected]> wrote:
> > >
> > > > I use 64-bit keys for vector-like data structures, and indeed you may
> > > > pay a cost in extra RAM, but it has a lot of benefits in simplicity
> > > > mostly, and making the probability of hash collisions ignorable even
> > > > at huge scale. I think it's worthwhile overall.
> > > >
> > > > On Wed, Jun 19, 2013 at 6:16 PM, Robin Anil <[email protected]>
> > > wrote:
> > > > > <rant>
> > > > > Which joker thought of removing uint from Java?
> > > > > </rant>
> > > > >
> > > > > Dan, the cost of moving to 64 bit for the index is extra RAM usage.
> > My
> > > > > experiments show that 32 bits is enough to hash down billions of
> > > > features.
> > > > > Do we ever need such Quadrillions of features? Can Machine learning
> > > truly
> > > > > work at that scale. Think about these.
> > > > >
> > > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > > >
> > > > >
> > > > > On Wed, Jun 19, 2013 at 5:16 AM, Dan Filimon <
> > > > [email protected]>wrote:
> > > > >
> > > > >> Also, this is particularly problematic because indices can't be
> > > > negative so
> > > > >> only 2^31 elements are actually possible.
> > > > >>
> > > > >>
> > > > >> On Wed, Jun 19, 2013 at 1:15 PM, Dan Filimon <
> > > > [email protected]
> > > > >> >wrote:
> > > > >>
> > > > >> > Hi everyone!
> > > > >> >
> > > > >> > The current Vector API only supports 32bit maximum indices for
> > > > Vectors.
> > > > >> >
> > > > >> > I feel that 64bits would be more appropriate especially because
> > the
> > > > >> > indices are likely to be hash values of other data and 32bit
> will
> > > > result
> > > > >> in
> > > > >> > quite a few collisions.
> > > > >> >
> > > > >> > Also, for some jobs, notably ItemSimilarityJob, this restriction
> > > means
> > > > >> > that we need a special id to index map where we'll collide
> anyway.
> > > > >> >
> > > > >> > What do you think about adding support for 64bit indices?
> > > > >> > Is anyone at all interested?
> > > > >> >
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > >
> > >   -jake
> > >
> >
>
>
>
> --
>
>   -jake
>

Reply via email to