Avinash already commited his new order-preserving hash function and I missed it. It's in OrderPreservingHashPartitioner. It takes the approach that Todd and I discussed back in January: turn the string into a base-Char.MAX_VALUE number. (http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed). I chatted with Avinash a little on IM but we didn't finish, so I'm picking it up here.
There are two problems with this approach. One is that the hashes will only be order-preserving for a subset of unicode (all of UCS-2, but not all of UTF-16; see http://en.wikipedia.org/wiki/UTF-16/UCS-2). The other is that this only gives you a naive ordering by code point value, which for unicode is not what you want and even for ascii sometimes you want another collation. (see http://www.unicode.org/reports/tr10/ and http://java.sun.com/javase/6/docs/api/java/text/Collator.html). Say for instance you have inserted keys ['a', 'b', '--a', '--b'] and you are going to do range queries on them. The correct collation-aware sort is ['a', '--a', 'b', '--b']. But ordering by char value gives ['--a', '--b', 'a', 'b']. Switching to a more flexibile system like the one I wrote for CASSANDRA-3 will let use use Token<BigInteger> for random distribution or Token<String> for order-preserving, with user-defined collation. I don't see a way to get this kind of flexibility from an approach that insists on turning everything into BigInteger. -Jonathan On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis <[email protected]> wrote: > Avinash, > > You mentioned that you have a new order-preserving hash function that > you think will be more generally useful. Can you post it? > > thanks, > > -Jonathan >
