On Wed, Apr 6, 2011 at 13:56, Andrew Hutchings <[email protected]>wrote:
> Hi Abhishek,
>
> On Wed, 2011-04-06 at 12:38 +0530, Abhishek Singh wrote:
>
> > First blueprint ( Phase 1 of the project) [1] is basically concerned
> > about the choosing the hashing algorithm for mapping client to shards.
> > Here you have proposed to use consistent hashing algorithm similar to
> > libketema for memcached. I would like to do some modification in this
> > regard. I would propose to use vbucket style partitioning for that,
> > which hashing ( I could re-use ketama inside libhashkit in-case) with
> > static mapping of vbuckets to shards. Since this algorithm is much
> > more robust than the probabilistically good ketama algorithm. Some of
> > the advantages are:
> > * Service request is never sent on the wrong server.
> > * Servers refuse to service commands for which they aren't
> > responsible.
> > * Servers do not know about each other.
> > * Data sets can be handed from one server another atomically.
> > * No temporal constraints involved.
> > * Absolute Consistency guaranteed.
> > * Network overhead in the normal case is nil.
>
> Sounds great to me :)
>
> > Second blueprint(Phase 2 of the project) [2] asks for "re-shuffling
> > sharded data between between servers, when servers are either added or
> > removed ".
> > In my opinion this phase is handled inside the 1st phase when we use
> > vbuckets. In-case of vbuckets we have replication phase where vbuckets
> > are re-mapped to newly added servers.
>
> Yes, I think you are correct there, as long as this is documented in a
> general use kind of way.
>
> > Third blueprint( Phase 3 of the project) [3] asks for "support groups
> > of servers so that each shard could contain a master and multiple
> > slaves for round-robin read performance and redundancy ".
> > So this Phase work has to done regarding failure handling. There are
> > basically 3 ways to have master-slave replication in-case of Drizzle:
> > * 1:n Replication: The first strategy (1:n) refers to a master
> > servicing multiple slaves concurrently [4].
> > * Chained Replication: The second strategy (chained) refers to a
> > single master servicing only a single slave, but having that
> > slave have a further downstream slave of its own. This offers
> > the advantage of having a single stream of mutation events
> > coming out of a server, while still maintaining two copies of
> > all records. This has the disadvantage of compounding
> > replication latency as you traverse the chain [5].
> > * Multi - Master Replication: It's out of the box functionality
> > available with Drizzle and it is still in beta phase [6].
> > Andrews I would like to ask on question in this regard. How actually
> > would we get the key related to each client? Clients aren't going to
> > manually supply key every-time they try to execute sharded query. This
> > key has to generated on the basis of user data automatically, and then
> > redirecting the client to the appropriate shard based upon that key
> > using vbuckets.
>
> I'm pretty sure the only way this can work is if the client supplies the
> key manually. I don't think this is a big ask. If it is something the
> client doesn't know the key for it will have to do a lookup on another
> table to find out what it is instead of using secondary indexes.
>
> > With some googling around I got to know some of the basic techniques
> > used in this regard, they are:
> > * Shard by the primary value on a table: This straight-forward
> > approach and easiest to implement. Here we have to calculate
> > MD5 hash of primary key and there-after allocate a shard to
> > client based upon that. This is only effective when data is
> > reasonably well distributed.
> > * Shard by the modulus of a key value: the modulus function
> > effectively distributes across your shards on a “round-robin”
> > basis, creating a very even distribution of new key values.
> > * Maintain a master shard index: This technique involves using a
> > single master table that maps various values to specific
> > shards. It is very flexible, and meets a large variety of
> > application situations.
> > I would appreciate suggestions in this regard.
>
> I'm not a fan of the third solution for this version (maybe after GSoC
> we can look into that). For now, keep it simple so it can be built upon
> later.
>
> I was thinking a combination of the first two approaches when I dreamt
> it up... so a modulus of the MD5. A modulus of the key value cannot
> always give a great distribution if your key always jumps by X value for
> example. I'll explain why:
>
> Imagine a situation where you have multiple clients writing to one
> sharded table. We can't use auto-increment because we need the key
> values to be predictable. So for an auto-number you would likely either
> have:
>
> 1. clients pre-fetch X number pre-defined of keys from somewhere to use
> and no other client can use this set of keys (this is what MySQL Cluster
> does)
> 2. Having an predictable increment jump (so with 32 clients client 1
> would increment by 32, client 2 by 32+1, etc...)
>
> For either of these I suspect a simple modulus of the value will not
> give a perfectly even distribution. Where as a modulus of the MD5 will
> give a much better distribution (and work better on strange data types).
Yeah, I definitely agree with you on this. Modulus of the MD5 will give much
even distribution.
In-case of vbuckets style partitioning, modulus of the decimal value of the
MD5 hash of the key supplied by client is taken over no. of vbuckets.
Image below gives a proper idea about that:
[image: vbucket.png]
h(k) = hashing function employed( like MD5, CRC32).
vb [ ] = array of vbuckets to which the clients request in-order to get a
server to serve its queries out of a large no. of servers.
s [ ] = set of servers.
so h(k) => vb [index_of_vbucket] mapping involves following calculation
index_of_vbucket = ( hex2dec( MD5_hash( key_supplied_by_client ) ) %
vb.size( ) ).
So. the probabilistically this distribution is much more even.
As in above image, vb[ vbucket_index ] => s [ server_index ] mapping is
many-to-one function.
>
Kind Regards
> --
> Andrew Hutchings - LinuxJedi - http://www.linuxjedi.co.uk/
>
>
_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help : https://help.launchpad.net/ListHelp