On Apr 24, 2019, at 4:45 PM, Kevin Bourrillion <kev...@google.com> wrote: > > On Wed, Apr 24, 2019 at 4:31 PM Stuart Marks <stuart.ma...@oracle.com> wrote: > >> However, I can see that >> some applications might want greater control over the actual hash function, >> so >> that capability probably will need to be provided through the API somehow.
> I would disagree with this strongly. Use cases that require a quality hash > code really need to use a proper hashing library like (bias alert) Guava's. > It can produce quality results because its API is designed for that. Kevin, are you saying that people who are concerned about hash quality should just stay away from Object.hashCode? Or are you saying that people who want greater control over actual hash functions should use an external library like Guava? (I'm guessing the former, but not sure here. I see the "strongly" but I can't process it without some more clarity.) > Object.hashCode() doesn't need to be a quality hash code! It only needs to be > good enough for in-memory hash tables, which is about as low as oneś > standards can possibly get. One should certainly avoid easily constructable > collisions... and then that's about all there is to worry about. And, you > can't reasonably get much better than that anyway, because you will > continually find yourself depending on Integer.hashCode() and > String.hashCode() etc., which will never get better than basic. This paragraph proves that Object.hashCode is pretty poor except for some limited use cases. It's actually worse than you claim: hashCode was designed when everybody ran Java on systems with less than a gigabyte of memory. Nowadays you can build in-memory structures which are *guaranteed* to experience collisions because they contain more than 2^32 keys. You point out that Integer and String have… "basic" hash codes; I would add List to that, since it uses the same "basic" mixing function as String. That was appropriate in 1995 when 32-bit multiply was slow, but now it's just an obsolete function seeking a sinecure for a barely-working retirement. These "basic" hash codes used to be reasonable, and now they need warning labels. Something along the lines of, "these so-called hash codes were reasonable in the 90's". But the Object.hashCode API can't be deprecated or ignored, because it is centrally located, and that makes it frankly useful, sometimes despite its shortcomings. This is why I'm hoping to find a path to replace it with something better, something equally useful (which includes ubiquitous) but not so basic. Or moldly. What excites me, in all this, is a possible convergence of a need (hashCode is increasingly antiquated), an opportunity (fast mixing primitives for wide values on stock hardware), some well-understood frameworks for hashing (like Guava's funnels and mixers IIRC), and some upcoming new capabilities for migrating old classes to new APIs (Valhalla bridges, coming soon I hope). Out of this convergence I think we might be able to obtain some good things: - A higher-precision hash code *in the JDK* that scales to all in-memory structures. (working title "longHashCode") - A migration path that allows classes to lean on old hash codes as a crutch until they can be upgraded. - Some easy ways to upgrade them (such as Liam's library, plus maybe some delegation stuff Brian is cooking up). - Good upgrades for all commonly used JDK types. And also a strong version of the randomization Stuart was talking about: - A contract that allows the JVM to switch hashes, by modifying salts and even algorithms, depending on hardware available. - (Optionally) A way to pass a dynamic salt to a hash code, for container-specific perturbations. But it might be OK to handle this as post-processing. Ultimately it probably can't be a one-size-fits all thing, so I think it might be reasonable to build something like the funnel and mixer framework into a JDK template function (when we have those) so that applications can ask for customized stronger or weaker, wider or narrower hash functions, maybe to cover data sets which are too large for memory and/or which need to be reproducible across JVMs. Then the standard hash code ("longHashCode" or the like) can be defined as a JVM-controlled instance of that framework. By then, if we start to roll out 128-bit integers in Java (using value types) we can roll out some larger hash codes at the same time, using the same migration tricks. Bonus point: If the JVM gets ownership of a pervasively used hash code, then hardening that hash code, even at the expense of JVM performance, becomes an allowed move. Kind of like compiling in side-channel remediations: You do it sometimes because you need more DiD, but not always because of a performance hit. HTH — John