Say for instance you have inserted keys ['a', 'b', '--a', '--b'] and
you are going to do range queries on them.  The correct
collation-aware sort is ['a', '--a', 'b', '--b'].  But ordering by
char value gives ['--a', '--b', 'a', 'b'].

Attached is a test program illustrating this.  The assert will fail
because the hash ordering is not the same as the collator's.

-Jonathan

On Wed, Apr 1, 2009 at 3:47 PM, Avinash Lakshman
<[email protected]> wrote:
> In fact what would you want is hash to respect oorder based on how the
> strings are sorted. For the example you have it does. So am I missing
> something?
>
> Avinash
>
> On Wed, Apr 1, 2009 at 2:43 PM, Jonathan Ellis <[email protected]> wrote:
>
>> For that aspect no difference between a String ring based on compareTo
>> and a BigInteger one.  The only difference (and it is an important one
>> for the reasons I gave!) is how the compare works.  But for the p2p
>> aspect it does not matter.
>>
>> On Wed, Apr 1, 2009 at 3:40 PM, Avinash Lakshman
>> <[email protected]> wrote:
>> > Doing what you are suggesting scares the hell out of me for a couple of
>> > reasons -  All work in P2P be it random/OPHF does the token handling the
>> way
>> > it is setup. I cannot try something that has not been well explored in
>> > academia. I insist this must be doable. I am going to think about this
>> more.
>> >
>> > Avinash
>> >
>> > On Wed, Apr 1, 2009 at 2:31 PM, Jonathan Ellis <[email protected]>
>> wrote:
>> >
>> >> Avinash already commited his new order-preserving hash function and I
>> >> missed it.  It's in OrderPreservingHashPartitioner.  It takes the
>> >> approach that Todd and I discussed back in January: turn the string
>> >> into a base-Char.MAX_VALUE number.
>> >> (
>> >>
>> http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed
>> >> ).
>> >>  I chatted with Avinash a little on IM but we didn't finish, so I'm
>> >> picking it up here.
>> >>
>> >> There are two problems with this approach.  One is that the hashes
>> >> will only be order-preserving for a subset of unicode (all of UCS-2,
>> >> but not all of UTF-16; see http://en.wikipedia.org/wiki/UTF-16/UCS-2).
>> >>  The other is that this only gives you a naive ordering by code point
>> >> value, which for unicode is not what you want and even for ascii
>> >> sometimes you want another collation.  (see
>> >> http://www.unicode.org/reports/tr10/ and
>> >> http://java.sun.com/javase/6/docs/api/java/text/Collator.html).
>> >>
>> >> Say for instance you have inserted keys ['a', 'b', '--a', '--b'] and
>> >> you are going to do range queries on them.  The correct
>> >> collation-aware sort is ['a', '--a', 'b', '--b'].  But ordering by
>> >> char value gives ['--a', '--b', 'a', 'b'].
>> >>
>> >> Switching to a more flexibile system like the one I wrote for
>> >> CASSANDRA-3 will let use use Token<BigInteger> for random distribution
>> >> or Token<String> for order-preserving, with user-defined collation.  I
>> >> don't see a way to get this kind of flexibility from an approach that
>> >> insists on turning everything into BigInteger.
>> >>
>> >> -Jonathan
>> >>
>> >> On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis <[email protected]>
>> wrote:
>> >> > Avinash,
>> >> >
>> >> > You mentioned that you have a new order-preserving hash function that
>> >> > you think will be more generally useful.  Can you post it?
>> >> >
>> >> > thanks,
>> >> >
>> >> > -Jonathan
>> >> >
>> >>
>> >
>>
>
package org.apache.cassandra.service;

import java.math.BigInteger;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Locale;
import java.text.Collator;

import org.testng.annotations.Test;

public class OrderPreservingHashPartitionerTest
{
    @Test
    public void testHash()
    {
        OrderPreservingHashPartitioner p = new OrderPreservingHashPartitioner();
        String[] keys = {"a", "b", "--a", "--b"};
        HashMap<BigInteger, String> hashed = new HashMap<BigInteger, String>();
        for (String key : keys) {
            hashed.put(p.hash(key), key);
        }

        ArrayList<BigInteger> sortedHashes = new ArrayList<BigInteger>(hashed.keySet());
        BigInteger last = null;
        Collator c = Collator.getInstance(new Locale("en_US"));
        for (BigInteger hash : sortedHashes) {
            if (last != null) {
                assert c.compare(hashed.get(last), hashed.get(hash)) <= 0;
            }
            last = hash;
        }
    }
}

Reply via email to