Hi Edward, - It will be interesting to see the comparison with the rebalance enhancements which have gone into 3.3 (to get a sense of urgency). - Even though the proposed scheme is not compatible with existing DHT (+AFR), I can see ways in which it can be backward compatible with existing volumes if care is taken in implementation details.
Avati On Mon, Apr 16, 2012 at 4:57 PM, Edward Shishkin <[email protected]> wrote: > Hello GlusterFS developers. > > We have found that current DHT translator is suboptimal: the number > of files being moved during re-balancing (caused by small changes in > the set of bricks) can be significantly reduced (see Appendix). > > To be precise, we can improve scalability: in the current DHT the amount > of re-balancing work scales as O(M) (M is total number of files in the > compound volume), whereas after changing the hashing technique it will > scale as O(M/N) (N is the number of bricks). > > In the document below we first consider simple tables (section 1) and > estimate minimal amount of rebalance work for them. Then we complicate > them with techniques of virtual nodes (section 2) and replication > (section 3), and show that this doesn't worse scalability. > > Unfortunately it is impossible to perform the improvements without > format change, so it would be a new translator, which won't understand > layouts created by current DHT (and back). > > We will be happy to see any feedbacks on this, and if everything is > OK, to proceed development in this direction (with implementation > details, etc). > > > Thanks, > Edward. > > > Legend: > > > C - namespace; > R - 64-bit ring; > N - number of real bricks that compose the volume; > M - number of files in the compound volume; > S - number of virtual components of any real brick; > R - size of preference set (replication level); > > > > 1. Simple DH tables based on consistent hashing > > > > We consider a 64-bit ring R, i.e. a regular 2^64-polygon with vertexes > 0, ..., 2^64 - 1. "Ring" means that for every vertex A there can be > found vertexes B, C of R, so that C < A < B. Here "<" and "<=" mean > relations between respective angles (for any vertex Z we consider the > angle composed of O0 and OZ, where O is the center of the polygon). > > Then we consider namespace C and any mapping phi: C -> R, so that for > every sequence {c1, c2, ... } of different names {phi(c1), phi(c2), > ...) is a pseudo-random variable, which has uniform distribution on R > (see 0*). > > > Suppose we have a volume composed of N bricks with unique names > B1, B2, ... B_N. During system initialization we construct for this > compound volume N respective unique tokens phi(B1), ..., phi(B_N) > in the ring R, caching them, say, in rb-tree to preserve ordering. > > When creating a directory (mkdir(2)) we create respective directories > on all bricks B_1, B_2, ... B_N. > > When creating a regular file (creat(2), etc) with a short name F, > we create a respective file only in one brick B = B(F), which is > determined by the following way: phi(B) is the minimal token in the > ring so that phi(F) <= phi(B). This is where the the notion of ring > works: if there is no any such token in the [phi(F), 2^64 - 1], then > we continue search from the vertex 0. > > Lookup operation for any regular file F resolves to lookup(F) on the > respective brick B(F). > > Deleting a regular file F resolves to deleting a file from the brick > B(F). > > Looking for a brick (i.e. calculation F->B(F)) is a well-scalable > operation: log(N) actions is required. > > When adding a new brick X = B_(N+1) we set a new unique token phi(X) > to the ring R and move a subset of files from B_j to X, where B_j is > the brick with the smallest phi(B_j), so that phi(X) < phi(B_j). > Namely, every file W of B_j with phi(W) <= phi(X) should be moved to X. > That said, the number of files to be moved during re-balancing is not > larger than a number of files contained in one brick (B_j in our case) > > When removing a brick Y = B_k, we first find in R the "closest" brick > B_s, which has minimal phi(B_s), so that phi(Y) < phi(B_s), and move > all files from Y to B_s (no scans is required). > > > Such hashing technique is called "consistent hashing associated with > the variable phi, which has uniform distribution". This is a > relatively new technique suggested by Karger at al (1*). The main > advantage of this technique is that small changes in the set of bricks > result in small amount of rebalancing work. Namely, adding/removing 1 > brick results in moving of only M/N files (2*) (M is total number of > files in the compound volume). This is M/(M/N) = N times better then > with traditional hashing, where we need to move all M files (median > value). In particular, if we have 100 bricks, then with traditional > hashing we'll need to move x100 files more than with consistent one. > > To construct a uniform distribution phi we can have any well-mixing > 64-hash, say fnv-hash, etc.. > > Comment 1. The technique of consistent hashing is used in Amazon's > Dynamo (4*) > > Comment 2. There is a disadvantage: in this approach all files > > /foo > /dir1/foo > /dir1/dir2/foo > ... > > will be accumulated on the same brick. However it is possible to > "salt" a short file names with gfid (or another id) of respective > directory before hashing, to avoid possible attacks. > > > > 2. Virtual nodes > > > > The theory above works well for larger number of bricks N. However, > when N is too small (2-3 bricks), then uniform distribution can result > in bad partitioning, so one brick will accumulate much more files then > other ones, which is not good. The graceful solution of this problem > is so-called technique of "virtual nodes": with every brick we set S > tokens on the ring, where S is a number of "virtual components" of a > brick. So, every brick is represented by S unique tokens on the ring > (S >= 1, S is a parameter of the cluster translator). > > In the case of S>1 the lookup-a-brick procedure above is not changed: > the difference is that we search in a larger set of tokens (N*S), and > since log(N*S) == log(N) + log(S) == log(N) + const, this search also > scales as log(N), while with a larger number of tokens the > distribution of files becomes more balanced (in terms of the standard > deviation, see (3*) and the Appendix for details. In particular, S=100 > provides deviation ~10% of the mean. > > Adding a new brick with S > 1 looks like above with the difference > that we steal files of S > 1 different old virtual bricks. Note, that > 2 or more of those virtual bricks can represent the same real brick > though. So adding a real brick with S virtual components requires > (M/N)*S scans, however, a median number of files to be moved during > re-balancing is the same (M/N) as in the case of S==1. > > Removing a brick with S > 1 virtual components mostly looks like in > the case of S == 1: no scans is requires. The difference is that we > distribute files of the brick to be removed among S virtual bricks > (which correspond to <= S real bricks). > > > > 3. Replication and Recovery > > > > To achieve high availability and durability we replicate files on > multiple bricks. In our case replication can be implemented as a set > of operations with the same ring R, so we don't create a separate > translator for replication. > > We store every file in its so-called preference set of real bricks. > Ordinal number R of this set is the volume option. R is also called > replication level (R = 1 means no replication: every file is stored > only in a single brick). > > For every file F its preference set is defined as a set of "closest" > virtual bricks B_(k_1), ... , B(k_R), which represent pairwise > different real bricks, so that B_(k_1) = B(F), and > phi(B_(k_1)) < phi(B_(k_2)) < ... < phi(B_(k_R)). > > We don't create 2 replicas of the same file on the same real brick, > so, R shouldn't be larger than N. > > If we enable replication (R>1), regular file operations become more > complicated: every such operation is performed for all respective > files located on all bricks of the preference set. > > Operations on a set of bricks also become more complicated, but > scalability doesn't suffer. When adding a new brick X = B_(N+1), we > > 0) set a unique token phi(X) to the ring R. > > 1) find R closest tokens B_(m_1), ..., B_(m_R), which represent > different real bricks in the ring, so that B_(m_R) == X, > and phi(B_(m_1)) < ... < phi(B_(m_R)). > > 2) find R+1 closest tokens B_(p_0), ..., B_(p_R), which represent > different real bricks in the ring, so that B_(p_0) == X, > and phi(B_(p_0)) < ... < phi(B_(p_R)). > > 3) The new brick X steals a portion of files of B_(p_1) as it has been > described in section (1) above. > > 4) The brick B_(p_R) becomes not belonging to the preference set of > the stolen files, so we need to remove all the respective replicas > from B_(p_R). > > 5) X becomes a brick belonging to the preference sets of files stored > in the bricks B_(m_1),... , B_(m_(R-1)), hence we should create > respective replicas on X. > > So adding a new brick with replication level R > 1 results in > > a) moving a portion of files of one brick (step (3) above); > b) replication of files located on on R-1 bricks (step (5) above); > c) deleting replicas of a portion of files of one brick (step (4)). > > (a),(b),(c) above can be considered as re-balancing of (R+1)*(M/N) = > const*(M/N) files (when R == 1, then (b),(c) are absent, and we need > to re-balance M/N files, as it was shown in the section 1 above). > > Similarly we can show that with level of replication R removing one > brick also leads to re-balancing of const*(M/N) files. > > > If in our configuration L < R bricks don't respond for some reasons, > then all regular file operations are still defined, however our system > is marked as "unhealthy" (in some papers this state is called "sloppy > quorum"), non-responding bricks are marked as "missed" in the ring and > file operation are performed on other available bricks of the > preference set. In such operations files on the available "non- > primary" bricks are marked as "not synced with the missed replicas". > > In the state of sloppy quorum operations like add/remove a node can > be already undefined. For example, when adding a brick we'll need to > steal files from a brick, which doesn't respond. > > So we need to return the system back to a "consistent" state, when all > operations are defined. It can be done by the following ways: > > 1) Make sure that all missed bricks are available again and perform > L1-recovery. L1-recovery means syncing all marked files with the > again available bricks, so that resulting consistent system will > have the same number N of bricks. > 2) Add new empty bricks instead of missed ones and perform L2-recovery > It means filling the new empty bricks with files from other bricks, > so that resulting consistent system will have the same number N of > bricks. > 3) Remove missed bricks from the ring and perform L3-recovery, so that > resulting consistent system will have smaller number of nodes (N-L) > > Comment 1. For any missed brick we can specify different type of > recovery. > > Comment 2. When R == N replication "clogs" the distribution: in this > case our system will work like mirrors: every bricks will contain > the same set of files. > > > > APPENDIX > > > > ------------------------------**------------------------------**---------- > > In 3 distributed hash tables with different hashing techniques > > . GlusterFS DHT translator (3.2.5) > . 64-bit ring with phi based on md5, R=1 (no replication), S=7 > . 64-bit ring with phi based on md5, R=1 (no replication), S=20 > > we run the same scenario: > > 1) Create 100 files ("file00", "file01", ..., "file99") in a volume > composed of 9 bricks: > > "host:/root/exp0", > "host:/root/exp1", > ... > > "host:/root/exp8". > > 2) Add one brick "host:/root/exp9"; > 3) re-balance; > > > I. GlusterFS DHT translator (3.2.5) > > ------------------------------**------------------------------**---------- > > before re-balancing: > > exp0: file15 file22 file34 file35 file51 file6 file68 file78 > file8 file81 file89 file94 file95 > exp1: file10 file28 file3 file4 file43 file66 file75 > exp2: file40 file46 file47 file48 file50 file58 file86 file9 > exp3: file12 file13 file32 file37 file54 file55 file7 file71 > file91 > exp4: file31 file38 file41 file42 file53 file62 file63 file69 > file93 file97 > exp5: file11 file16 file17 file24 file25 file27 file29 file44 > file56 file73 file74 file80 file87 file90 > exp6: file0 file1 file2 file33 file36 file49 file57 file59 > file64 file77 file79 file84 file85 file88 file98 > exp7: file21 file26 file39 file52 file61 file70 file72 file83 > file92 file99 > exp8: file14 file20 file23 file30 file45 file5 file60 file65 > file67 file76 file82 file96 > > after re-balancing: > > exp0: file11 file16 file17 file24 file25 file31 file44 file53 > file62 file69 file73 file80 file87 file93 file97 > exp1: file0 file27 file29 file33 file36 file56 file57 file64 > file74 file77 file84 file88 file90 file98 > exp2: file1 file2 file39 file49 file59 file72 file79 file85 > file92 > exp3: file21 file26 file30 file45 file52 file60 file61 file65 > file70 file83 file99 > exp4: file14 file20 file23 file5 file67 file76 file82 file96 > exp5: file15 file22 file34 file35 file51 file6 file68 file78 > file8 file81 file89 file94 file95 > exp6: file10 file28 file4 file43 file66 file75 > exp7: file3 file40 file47 file58 file9 > exp8: file12 file13 file32 file37 file46 file48 file50 file7 > file71 file86 > exp9: file38 file41 file42 file54 file55 file63 file91 > > > as the result 98 files have been rebalanced (total files scanned 139) > > > > II. 64-bit ring with phi based on md5. > Every brick has number of virtual components S=7 > > ------------------------------**------------------------------**---------- > > before re-balancing: > > exp0: file02 file18 file22 file42 file48 file58 file62 file70 > exp1: file01 file08 file15 file23 file33 file51 file64 file75 > file82 file85 file86 file87 file95 > exp2: file00 file10 file11 file14 file25 file29 file40 file45 > file63 file81 file91 file96 > exp3: file09 file16 file19 file21 file24 file28 file32 file35 > file36 file44 file47 file50 file52 file57 file67 file73 > file88 file98 > exp4: file27 file49 file53 file55 file69 file97 > exp5: file05 file20 file43 > exp6: file34 file68 file72 file74 file79 > exp7: file03 file04 file26 file39 file41 file54 file60 file71 > file77 file78 file83 file84 file89 file93 file94 > exp8: file06 file07 file12 file17 file30 file31 file37 file38 > file46 file56 file59 file61 file65 file66 file76 file80 > file90 file92 file99 > > after re-balancing: > only the following files has been moved (to exp9): > > exp0: file70 > exp1: file82 file85 > exp2: file00 file14 file25 file40 file45 file91 file96 > exp3: file88 > > as the result 11 files have been rebalanced (total files scanned 51) > > > > III. 64-bit ring with phi based on md5. > Every brick has number of virtual components S=20 > > ------------------------------**------------------------------** > ----------- > > before re-balancing: > > exp0: file00 file04 file22 file30 file40 file42 file70 file96 > exp1: file06 file08 file13 file15 file17 file19 file23 file32 > file33 file78 file81 file86 file95 > exp2: file11 file14 file16 file24 file29 file57 file63 file67 > file73 file76 > exp3: file02 file10 file12 file18 file21 file28 file35 file51 > file56 file59 file80 file87 > exp4: file39 file41 file49 file53 file54 file62 file69 file77 > file83 file84 file91 > exp5: file05 file20 file31 file43 file68 file74 > exp6: file34 file37 file38 file46 file48 file58 file66 file71 > file75 file79 file88 > exp7: file03 file07 file26 file47 file60 file72 file89 file93 > file94 > exp8: file01 file09 file25 file27 file36 file44 file45 file50 > file52 file55 file59 file61 file64 file65 file82 file85 > file90 file92 file97 file98 file99 > > after re-balancing: > only the following files has been moved (to exp9): > > exp0: file70 > exp6: file88 > exp7: file07 file93 > exp8: file45 file82 file85 > > as the result 7 files have been rebalanced (total files scanned 48) > > > ------------------------------**------------------------------**---------- > > Table with results > > Re-balance results and standard deviations (sd) > for current gluster DHT translator (3.2.5), and > for 64-bit rings with S=7 and S=20. > > > ------------------------------**------------------------------**---------- > > DHT(Gluster 3.2.5) 64-bit ring, S=7 64-bit ring, S=20 > > ------------------------------**------------------------------**---------- > > files moved 98 11 7 > > files scanned 139 51 48 > > sd before 2.8 5.4 3.7 > > sd after 3.2 5.3 3.2 > > ------------------------------**------------------------------**---------- > > > > ---------- > > > (0*) > http://en.wikipedia.org/wiki/**Uniform_distribution_%**28discrete%29<http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29> > (1*) Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.; > Lewin, D. > (1997). "Consistent Hashing and Random Trees: Distributed Caching > Protocols > for Relieving Hot Spots on the World Wide Web" > (2*) M/N is a median value > (3*) http://weblogs.java.net/blog/**tomwhite/archive/2007/11/** > consistent_hash.html<http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html> > (4*) > http://www.**allthingsdistributed.com/2007/**10/amazons_dynamo.html<http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html> > > ______________________________**_________________ > Gluster-devel mailing list > [email protected] > https://lists.nongnu.org/**mailman/listinfo/gluster-devel<https://lists.nongnu.org/mailman/listinfo/gluster-devel> >
_______________________________________________ Gluster-devel mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/gluster-devel
