Have you tried murmur3 hash. We have an impl in Mahout On Jun 19, 2013 2:43 PM, "Jake Mannix" <[email protected]> wrote:
> On Wed, Jun 19, 2013 at 12:09 PM, Dan Filimon > <[email protected]>wrote: > > > I understand that it might not be worth it to have all vectors have 64bit > > indices all the time. > > > > My use case is very sparse vectors, that have IPs in them for example. > > A particular vector will likely have at most 20 IPs, but since Java > doesn't > > have unsigned, even IPs need to be hashed (and they're 32bit! :). > > > > As for actual hashes, for the data I'm working with, I see 0.4% of the > data > > being lost because of collisions. Granted, this is not much, but there's > > also not that much data... > > It may also be true that the hash function I'm using isn't the best > (Java's > > hashCode() (on an unrelated note, any suggestions for a better one?). > > > > Sean, is this structure you're using available anywhere? > > > > What I'm proposing would not change all the Vectors (at least not at > once). > > For instance, I'm thinking of a SequentialAccesSparseLongVector class. > This > > would be straightforward, as we only need an OrderedLongDoubleMapping > > (which could be templatized like other math containers) and a long size. > > It could still support the same interface as Vector but add a few more > > functions... This would be my hack. :) > > > > It would still implement Vector? So you could pass in int indexes and get > back values? And if you iterated over it, you'd iterate over > Vector.Element > objects, which have int keys... I'm not sure this would be as simple as > you > imagine it would... > > > > > > As for a more long-term solution, perhaps also templatizing Vector to > > support ints and longs and have the primitive classes generated at > compile > > time? > > > > Yeah, we could pull up all the codegen from the mahout-collections up to > Vector. That's a possibility. Integrating that with the Mapreduce code > might be a little bit complex, but maybe not. > > > > > > > > On Wed, Jun 19, 2013 at 9:22 PM, Jake Mannix <[email protected]> > > wrote: > > > > > long keys are super useful for rows in a matrix (ids for documents), > and > > > basically free in terms of memory (only one per document), but then for > > > symmetry we really do need them in the columns (keying on e.g. termId), > > > which is a not-insubstantial cost, but possibly worth it. > > > > > > Our vectors would be (16* numNonZeroEntries) bytes in footprint. > That's > > > pretty hefty, but not too much more than 12. > > > > > > There are arguments that most of the time, we don't need double values > > > either. Sometimes, we don't need values at all (boolean data), but we > > > could certainly have special-purpose Vectors which carry no values and > > yet > > > return 1d for when the key is present. > > > > > > But changing over all of our keys to long is a pretty big change. Is > it > > > worth it? > > > > > > > > > On Wed, Jun 19, 2013 at 10:25 AM, Sean Owen <[email protected]> wrote: > > > > > > > I use 64-bit keys for vector-like data structures, and indeed you may > > > > pay a cost in extra RAM, but it has a lot of benefits in simplicity > > > > mostly, and making the probability of hash collisions ignorable even > > > > at huge scale. I think it's worthwhile overall. > > > > > > > > On Wed, Jun 19, 2013 at 6:16 PM, Robin Anil <[email protected]> > > > wrote: > > > > > <rant> > > > > > Which joker thought of removing uint from Java? > > > > > </rant> > > > > > > > > > > Dan, the cost of moving to 64 bit for the index is extra RAM usage. > > My > > > > > experiments show that 32 bits is enough to hash down billions of > > > > features. > > > > > Do we ever need such Quadrillions of features? Can Machine learning > > > truly > > > > > work at that scale. Think about these. > > > > > > > > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > > > > > > > > > > > > > > On Wed, Jun 19, 2013 at 5:16 AM, Dan Filimon < > > > > [email protected]>wrote: > > > > > > > > > >> Also, this is particularly problematic because indices can't be > > > > negative so > > > > >> only 2^31 elements are actually possible. > > > > >> > > > > >> > > > > >> On Wed, Jun 19, 2013 at 1:15 PM, Dan Filimon < > > > > [email protected] > > > > >> >wrote: > > > > >> > > > > >> > Hi everyone! > > > > >> > > > > > >> > The current Vector API only supports 32bit maximum indices for > > > > Vectors. > > > > >> > > > > > >> > I feel that 64bits would be more appropriate especially because > > the > > > > >> > indices are likely to be hash values of other data and 32bit > will > > > > result > > > > >> in > > > > >> > quite a few collisions. > > > > >> > > > > > >> > Also, for some jobs, notably ItemSimilarityJob, this restriction > > > means > > > > >> > that we need a special id to index map where we'll collide > anyway. > > > > >> > > > > > >> > What do you think about adding support for 64bit indices? > > > > >> > Is anyone at all interested? > > > > >> > > > > > >> > > > > > > > > > > > > > > > > -- > > > > > > -jake > > > > > > > > > -- > > -jake >
