The lists are rejecting the email because of the big attachments. Send with links instead?
On Tue, 9 Sep 2014, Zhang, Jian wrote: > Yujie sent out the following email yesterday, but it seems it was missed. > Resending it. > > ============= > Hi all, > ? Several months ago we met an issue of read performance issues (17% > degradation) when working on ceph object storage performance evaluation with > 10M objects (scaling from 10K objects to 1Million objects) , and found the > root cause is unbalanced pg distribution among all osd disks, leading to > unbalanced data distribution. We did some further investigation then and > identified that CRUSH failed to map pgs evenly to each osd. Please refer to > the attached pdf (crush_proposals) for details. > > Key Message: > ? As mentioned in the attached pdf, we described possible optimization > proposals for CRUSH and got some feedback from community > (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage > suggested us take the idea of "Change placement strategy only for step of > selecting devices from hosts", by adding a new bucket type called ?linear?, > and applying a modulo-like hash function to this kind of buckets to achieve > balanced distribution. We followed this suggestion and designed an optimized > CRUSH algorithm, with new hash methods and an adaptive module. Please refer > to the Design and Implementation part for details. We also wrote some POC for > it, see the attached patch. And as a result, we got more than 10% read > performance improvement using the optimized CRUSH algorithm. > > Design and Implementation: > 1. Problem Identification > 1.1 Input key (pps) space of CRUSH is not uniform > Since PG# on the nested devices of a host is not uniform even if we > select the device using simple modulo operation, we decide to change the > algorithm of hashing raw pg to pps. > 1.2 Algorithm of selecting items from buckets is not uniform > After we get uniform input key space, we should make the procedure of > selecting devices from host be uniform. Since current CRUSH algorithm uses > Jenkins hash based strategies and failed to reach the goal, we decide to add > a new bucket type and apply new (modulo based) hash algorithm to make it. > 2. Design > 2.1 New pps hash algorithm > We design the new pps hash algorithm based on the "Congruential > pseudo-random number generator" > (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a > bijection between the original sequence {0, ...,2^N-1} and some permutation > of it. In other words, given different keys between 0 and 2^N-1, the > generator will produce different integers, but within the same range > {0,...,2^N-1}. > Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, > np?2^n<2*np) as the key, and then it will be hashed into a pps value between > 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles > the original pgid sequence as output in this case, making the key space > consisting of a permutation of {0,...,2^n-1}, which achieves the best > uniformity. Moreover, poolid can be regarded as a seed in the generator, > producing different pps value even with the same pgid but different poolid. > Therefore, pgid sequences of various pools are mapped into distinct pps > sequences, getting rid of PG overlapping. > 2.2 New bucket type, Linear > We introduce a new bucket type called "linear", and apply a new modulo > based hash algorithm to it. As the pps values assigned to each host are a > pseudo-random subset of the original permutation and is possibly out of > uniformity, in which situation applying modulo operation directly on integers > in the subset cannot produce balanced distribution among disks in the host. > To decrease deviation of the subset, we apply a balance parameter > 1/balance_param to the key before conducting the modulo method. > For osd failure and recovery, it assumes that items nested in this kind > of bucket will not be removed, nor new items are added, same as the UNIFORM > bucket. Linear host will not introduce more data movement than the uniform > bucket. > 2.3 Adaptive Strategy > Since there is no one constant balance parameter applying for all cases > that will result in the best PG distribution. We make it an adaptive > procedure by adjusting the balance parameter automatically during the > preparation for creating a new pool, according to different cluster topology, > PG# and replica#, in order to gain a most uniform distribution. > ??1) Try different balance_param when preparing for a new pool > ????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get > corresponding PG distribution with different balance_params > ????- Calculate stdev of PG# among all osds > ????- Choose the balance_param with the minimal stdev > 2) Add a member variable to pool struct pg_pool_t to save the best > balance_param value > ??The adaptive procedure can be described as following: > Input: cluster map, total PG number m, adaptive retry times n > Output: local optimal balance parameter balance_param min_pg_stdev = MAX; > balance_param = a; // initial value for trial from 0 to n { > for pgid from 0 to m { > calculate pps using the new generator in 2.1; > for bucket b in cluster map // apply CRUSH algorithm > apply corresponding bucket hashing algorithm and get a osd list > for pgid > } > calculate pg_stdev_a by gathering all osd lists; // stdev of PG > distribution among all osds > if pg_stdev_a < min_pg_stdev { > min_pg_stdev = pg_stdev_a; > balance_param = a; > } > adjust a to a new value; > } > > > Evaluation: > We evaluated the performance of optimized and current CRUSH in a cluster > consisting of 4 hosts, and each attaching with 10x1T disks. We designed two > test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, > preparing 100 million 128KB objects, and then evaluating read performance of > these objects; for the second one, by creating a pool with 2048 pgs, 3 > replicas, preparing 1 million 10MB objects, and then evaluating read > performance of these objects. > We compared the PG & data distribution and read performance of the two > CRUSH algorithms, and got results as follows: > 1. PG and data distribution is more balanced using optimized CRUSH algorithm > ??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from > 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds > decreases from 10.09 to 6.50 > ??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 > osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 > replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042 > 2. Large scaled performance is improved since data distribution is more > balanced > ??a) More than 10% performance improvement for 128K and 10M read > ??b) Write performance not impacted > Detailed performance data can be found in the attached pdf > (crush_optimization). > > We also created a pull request: https://github.com/ceph/ceph/pull/2402 > > Thanks > Jian > > > -----Original Message----- > From: Loic Dachary [mailto:[email protected]] > Sent: Tuesday, September 09, 2014 9:36 PM > To: Zhang, Jian; [email protected] > Subject: Re: FW: CURSH optimization for unbalanced pg distribution > > > > On 20/03/2014 04:54, Zhang, Jian wrote: > > Forwarding per Sage's suggestion. > > Very interesting discussion :-) For the record the corresponding pull request > is https://github.com/ceph/ceph/pull/2402 > > > > > > > -----Original Message----- > > From: Sage Weil [mailto:[email protected]] > > Sent: Wednesday, March 19, 2014 11:29 PM > > To: Mark Nelson > > Cc: Zhang, Jian; Duan, Jiangang; He, Yujie > > Subject: Re: CURSH optimization for unbalanced pg distribution > > > > On Wed, 19 Mar 2014, Mark Nelson wrote: > >> On 03/19/2014 03:24 AM, Zhang, Jian wrote: > >>> For more detail data, please refer to the *Testing results* part. > >>> > >>> *Optimization proposals: * > >>> > >>> After we dived into the source code of CRUSH and related papers, we > >>> proposed two possible optimizations: > >>> > >>> 1.Add different hash algorithms, as an alternative for the Jenkin's > >>> hash, e.g. algorithm that will produce even values when range of > >>> input value (pg#) is relatively small. Or add new bucket type at the > >>> same time if necessary. > > > > This *might* work, but I don't have a strong intuition about it. The > > modeling we've done now has essentially assumed a statistically uniform > > distribution, which has some inherent inbalance for low values of n (num > > pgs in our case). I have generally assumed we can't do better than > > "random", and still have the other properties we want (independent, > > deterministic placement), but it may be possible. > > > >>> > >>> 2.Find a better replica placement strategy instead of current retry > >>> logic of crush_choose_firstn, which may cause CRUSH to behave badly. > >>> > >>> We find there are several threshold of retry times by referring to > >>> code, choose_total_tries, choose_local_tries and > >>> choose_local_fallback_tries. > >>> They are used to decide whether to do a retry_bucket, retry_descent > >>> or use permutation to do an exhaustive bucket search. We are > >>> wondering if there is another retry strategy: > >>> > >>> a)Backtracking retry. Now the logic of crush_choose_firstn can only > >>> issue an retry either from the initial bucket(retry_descent) or from > >>> the current bucket (retry_bucket), how about retrying the intervening > >>> buckets? > >>> > >>> b)Adjust threshold of retry times by other values. We do noticed > >>> that the 'optimal' crush tunable could be used to make it, but we > >>> still encounter unbalanced [g distribution by using the optimal strategy. > >>> Please refer to 4 of the Testing results part. > >>> > >>> c)Add an mechanism that can adjust above mentioned thresholds > >>> adaptively. Maybe we can record the retry times of the previous call > >>> for CRUSH, and adjust retry thresholds automatically according to the > >>> record. > > > > I suggest ignoring all of this retry logic. The original version of > > CRUSH has the local retries to try to make data move "less far", but > > when we went back a year ago and did a statistical analysis of the > > distribution we found that *all* of these hacks degraded the quality > > of the placement,a nd by turning them all off (setting the 'optimal' > > values which zeroes them all out excent for total_retries) we got > > something that was indistinguishable from a uniform distribution. > > > >>> 3.Add soft link for pg directories. During pg creation, we can > >>> create soft links for the pgs if pg# on the selected osd is more > >>> than some threshold, say 10% more than desired average number, to > >>> move objects that will be stored in this pg to another osd. Balanced > >>> disk utilization may be gained in this way. > > > > I think you need to be careful, but yes, this is an option. There is > > a similar exception mechanism in place that is used for other purposes > > and something similar could be done here. The main challenge will be > > in ensuring that the soft links/exceptions follow the same overall > > policy that CRUSH does after the raw mapping is performed. This is an > > option, but I would put it toward the bottom of the list... > > > >>> 4.Change placement strategy only for step of selecting devices from > >>> hosts. We found in our testing results that pg distribution was > >>> balanced among hosts, which is reasonable since pg# of each host is > >>> above 1K (according to the current BKM that pg# per osd should be > >>> about 100). So how about we apply CRUSH only on the interval buckets > >>> and find another simple but more balanced method to choose osd from host? > > > > This idea has a lot of potential. For example: > > > > If you know the chassis can hold 12 disks, you can force the bucket > > size to twelve and somehow prevent users from adjusting the structure > > of the tree. Then you can use a simple mapping that is truly flat > > (like a linear mapping, disk = x % num_disks) for that bucket/subtree. > > The downside of course is that if you remove a disk *everything* > > reshuffles, hence some sort of guardrails to prevent a user from > > inadvertantly doing that. If a disk *does* fail, you just need to > > make sure the disk is marked "out" but not removed from the CRUSH hierarchy > > and the normal retry will kick in. > > > > Note that all this is reall doing is increasing the size of the "buckets" > > that we are (pseudo)randomly distribution over. It is still a > > random/uniform distribution, but the N value is 12 times bigger (for a > > 12 disk chassis) and as a result the variance is substantially lower. > > > > I would suggest making a new bucket type that is called 'linear' and > > does a simple modulo and trying this out. We will need a bunch of > > additional safety checks to help users avoid doing silly things (like > > adjusting the number of items in the linear buckets, which reshuffle > > everything) but that wouldn't be needed for an initial analysis of the > > performance impact. > > > > Do you mind if we shift this thread over to ceph-devel? I think there > > are lots of people who would be interested in this discussion. We can > > of course leave off your attachment if you prefer. > > > > Thanks! > > sage > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to [email protected] More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > -- > Lo?c Dachary, Artisan Logiciel Libre > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
