Hi All, There might be a simpler way to make the OPP achieve even, or close to even loads.
The small change here is that the OPP has to use thresholds to distribute keys instead of centers. Every node should have a MIN and a MAX threshold. A key gets inserted in a node x if MIN_x<k<=MAX_x . Nodes share the thresholds between them, so MAX_x = MIN_(x+1) for all x=0 to n-1. If ever a key k is attempted to be inserted and k < MIN_0, then we set MIN_0 = k -1 . Similarly, if ever a key k is attempted to be inserted and k is > MAX_(n-1), then set MAX_(n-1)=k Those thresholds with such setup can be recalculated very easily to redistribute the data evenly. Actually, after doing some thinking I came up with two algorithms, one I call minor redistribution, and the other I called major redistribution, and the goal of doing a redistribution is to achieve an equal number of keys per node. The minor redistribution algorithm does not require full scan and can recalculate the thresholds very fast, but is approximate. The major redistribution may require a full key scan (or partial depending on the implementation) and will be able to exactly calculate the node thresholds to achieve equal loads. Due to the full (or partial) key scan requirement, the major redistribution will require longer time to process. Minor redistribution ---------------------------- Step1: Calculate the desired load per node L= Total number of keys in the cluster / n Step2: Update the max thresholds of nodes 0 to n-2 to achieve the average load in every node n_x is a snapshot of the number of keys in node x Node average density D_x=Number of keys in node x / (MAX_x - MIN_x) If n_x > L then // We're moving the max threshold back into the node, since it is overloaded New Max= MIN_x + L / D_x if (x<n-1) n_(x+1)+=n_x-L; else // We're moving the max threshold into the next node, as the node is under fulled. Use the next node's density for better approx. New Max= MAX_x + (L-n_x)/D_(x+1) if (x<n-1) n_(x+1)-=n_x-L; After the new thresholds are calculated, then nodes should move the data. The approximate here is the assumption that keys are evenly distributed over the range of every node, and I chose that because it is the simplest in my point of view. Since the data we have is already and incomplete data set (as more keys are expected in the future), any assumption of any distribution will have errors, so we rather use the simplest. Major redistribution ---------------------------- This is actually much simpler to do. We know that we need every node to have L keys (as calculated in the minor distribution). Starting from the smallest key, move up L keys and set the max threshold, and by repeating we can actually figure out the max threshold of every node. That where actually we might need a full key scan, to implement this hopping of L keys to calculate the max. threshold. Hopefully this helps, or may be tickles some one else's brain to produce a nicer idea. Best, Mohamed Ibrahim On Thu, Aug 26, 2010 at 12:25 AM, J. Andrew Rogers <jar.mail...@gmail.com>wrote: > Hi Jonathan, > > I've never seen a paper that discusses it as a primary topic, it is > always in some other context. IIRC, the most recent discussions of it > I have seen have been in join algorithm literature from somewhere in > Asia. MPP analytical databases often implement some form of skew > adaptivity but there is no standard way because the design tradeoffs > are context dependent. DB2 also has a non-adaptive technique for > dealing with skew that should be simple to implement on Cassandra and > might provide an 80/20 option (more on that a little further on). > > Skew adaptivity is generally implemented with a mix of data structures > along the lines of an adaptive quad-tree. The reason you only see this > in analytical databases is that the data skew is unlikely to change > much and/or have too much concurrent updating. If the distribution > radically changes all of the time under high concurrency, it will > create some mix of resource contention, lost selectivity, or runaway > space consumption depending on implementation detail. The optimal mix > of pain tends to be a compile-time option, so it isn't very flexible. > Definitely not optimal for concurrent OLTP-ish workloads. > > Alternatively: > > IBM's DB2 has a couple different data organization options that > essentially define partitionable skew invariants. The closer the real > data distribution is to the skew invariant, the more access > performance is like a distributed hash table. For many data models, > the data distribution skew can be roughly predicted ahead of time and > for those applications it is relatively efficient. You can take this > pretty far; DB2 has libraries of skew invariants based on irregular > Voronoi tesselation and I believe the most recent version of SQL > Server has something similar. I think they even have tools for > sampling representative data for the purposes of finding a good skew > invariant. > > It is not much but I hope that helps. > > Data distribution skew handling only partly mitigates data access skew > issues (e.g. temporal data) but it is better than nothing. > > -- > J. Andrew Rogers > > > > On Tue, Aug 24, 2010 at 10:30 AM, Jonathan Ellis <jbel...@gmail.com> > wrote: > > What are some good papers to read for background? > > > > On Tue, Aug 24, 2010 at 12:26 PM, J. Andrew Rogers > > <jar.mail...@gmail.com> wrote: > >> On Mon, Aug 23, 2010 at 8:36 PM, Hien. To Trong <hie...@vng.com.vn> > wrote: > >>> OrderPreservingPartitioner is efficient range queries but can cause > >>> unevently distributed data. Does anyone has an idea of a > >>> HybridPartitioner which takes advantages of both RandomPartitioner > >>> and OPP, or at least a partitioner trade off between them. > >> > >> > >> What you are looking for is skew adaptive partitioning i.e. like a > >> B+Tree except distributable. > >> > >> A couple different methods for doing something like this exist, but > >> you rarely see them and they have their own (different) tradeoffs. To > >> the best of my knowledge, implementation requires a fairly deep > >> architectural commitment; it is more involved than simply defining a > >> partitioning function and the "adaptive" aspect must be distribution > >> friendly. It is an active area of research in the literature with no > >> obvious and simple solutions that can be lashed onto a database engine > >> "as is". > >> > >> -- > >> J. Andrew Rogers > >> > > > > > > > > -- > > Jonathan Ellis > > Project Chair, Apache Cassandra > > co-founder of Riptano, the source for professional Cassandra support > > http://riptano.com > > >