Re: OPHF

Jonathan ellis Thu, 02 Apr 2009 03:41:25 -0700

Because that is the ordering the app needs. (but! Other apps willhave different needs. That is why it must be pluggable.) Yes, keysorting at sstable level needs to be consistent with what thepartioner uses. My patch does this.


-Jonathan

On Apr 1, 2009, at 10:43 PM, Avinash Lakshman <[email protected]> wrote:

So why do you want to sort with collation? Because if you sort thatway for
distribution then everything from sorting of keys on disk will have to
change. This is not something that I can see any apparent reasonfor. But I
will look at this.
Avinash
On Wed, Apr 1, 2009 at 3:34 PM, Jonathan Ellis <[email protected]>wrote:
So with a String ring you just sort the strings (conceptually) andyou
wrap around from strings[last] to strings[0].  The range is entirely
defined by what tokens you have.

It's nice to be able to test against a real application.  Mine won't
run without range queries...  (So it is still running against my
github code where I first wrote range queries, but that uses the OPHF
approach you have and so that is where I found the problems.)

-Jonathan

On Wed, Apr 1, 2009 at 4:27 PM, Avinash Lakshman
<[email protected]> wrote:
I will look into this. Another point about P2P rings is somethingabout
the
ability to reason about the system. In any hash fn we hve a notionof
range
and how the ring wraps around. For eg. in random hash such as MD5we say0-2^128 -1 and then the wrap around. Similarly for the hash fnused here
one
could define the same. Today we perform usual sort of strings andthe
hash
should respect that sort order. I guess you are saying the usualsort ofstring does not respect collation. I will look into this and Iinsist
this
should be very doable. Changes to the implementation of the coreclassesalbeit how simple they are very very scary unless they arecompletely
tested
too in a distributed setting. Over here we do not have detailed test
code,
but we test by directing a % of the site traffic to a test clusterbefore
we
sign off on anything.

Avinash

On Wed, Apr 1, 2009 at 3:00 PM, Jonathan Ellis <[email protected]>
wrote:
Say for instance you have inserted keys ['a', 'b', '--a', '--b']and
you are going to do range queries on them.  The correct
collation-aware sort is ['a', '--a', 'b', '--b'].  But ordering by
char value gives ['--a', '--b', 'a', 'b'].

Attached is a test program illustrating this.  The assert will fail
because the hash ordering is not the same as the collator's.

-Jonathan

On Wed, Apr 1, 2009 at 3:47 PM, Avinash Lakshman
<[email protected]> wrote:
In fact what would you want is hash to respect oorder based onhow thestrings are sorted. For the example you have it does. So am Imissing
something?

Avinash

On Wed, Apr 1, 2009 at 2:43 PM, Jonathan Ellis <[email protected]>
wrote:
For that aspect no difference between a String ring based on
compareTo
and a BigInteger one. The only difference (and it is animportant
one
for the reasons I gave!) is how the compare works. But for thep2p
aspect it does not matter.

On Wed, Apr 1, 2009 at 3:40 PM, Avinash Lakshman
<[email protected]> wrote:
Doing what you are suggesting scares the hell out of me for a
couple
of
reasons -  All work in P2P be it random/OPHF does the token
handling
the
way
it is setup. I cannot try something that has not been wellexplored
in
academia. I insist this must be doable. I am going to thinkabout
this
more.
Avinash
On Wed, Apr 1, 2009 at 2:31 PM, Jonathan Ellis <[email protected]>
wrote:
Avinash already commited his new order-preserving hash function
and I
missed it. It's in OrderPreservingHashPartitioner. It takesthe
approach that Todd and I discussed back in January: turn the
string
into a base-Char.MAX_VALUE number.
(
http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed
).
I chatted with Avinash a little on IM but we didn't finish, so
I'm
picking it up here.
There are two problems with this approach. One is that thehashes
will only be order-preserving for a subset of unicode (all of
UCS-2,
but not all of UTF-16; see
http://en.wikipedia.org/wiki/UTF-16/UCS-2
).
The other is that this only gives you a naive ordering by code
point
value, which for unicode is not what you want and even forascii
sometimes you want another collation.  (see
http://www.unicode.org/reports/tr10/ and
http://java.sun.com/javase/6/docs/api/java/text/Collator.html).
Say for instance you have inserted keys ['a', 'b', '--a', '--b']
and
you are going to do range queries on them.  The correct
collation-aware sort is ['a', '--a', 'b', '--b']. Butordering by
char value gives ['--a', '--b', 'a', 'b'].

Switching to a more flexibile system like the one I wrote for
CASSANDRA-3 will let use use Token<BigInteger> for random
distribution
or Token<String> for order-preserving, with user-defined
collation.
I
don't see a way to get this kind of flexibility from anapproach
that
insists on turning everything into BigInteger.

-Jonathan

On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis <
[email protected]>
wrote:
Avinash,
You mentioned that you have a new order-preserving hashfunction
that
you think will be more generally useful.  Can you post it?

thanks,

-Jonathan

Re: OPHF

Reply via email to