[ 
https://issues.apache.org/jira/browse/CASSANDRA-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381947#comment-16381947
 ] 

Kenneth Brotman edited comment on CASSANDRA-14267 at 3/1/18 4:30 PM:
---------------------------------------------------------------------

>From the Dynamo paper itself: 
>[https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf]

See sections 4.2 through 4.9


was (Author: kenbrotman):
>From the Dynamo paper itself: 
>[https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf]

4.2 Partitioning Algorithm One of the key design requirements for Dynamo is 
that it must scale incrementally. This requires a mechanism to dynamically 
partition the data over the set of nodes (i.e., storage hosts) in the system. 
Dynamo’s partitioning scheme relies on consistent hashing to distribute the 
load across multiple storage hosts. In consistent hashing [10], the output 
range of a hash function is treated as a fixed circular space or “ring” (i.e. 
the largest hash value wraps around to the smallest hash value). Each node in 
the system is assigned a random value within this space which represents its 
“position” on the ring. Each data item identified by a key is  assigned to a 
node by hashing the data item’s key to yield its position on the ring, and then 
walking the ring clockwise to find the first node with a position larger than 
the item’s position. 
  
 Thus, each node becomes responsible for the region in the ring between it and 
its predecessor node on the ring. The principle advantage of consistent hashing 
is that departure or arrival of a node only affects its immediate neighbors and 
other nodes remain unaffected.

 4.8 Membership and Failure Detection 4.8.1 Ring Membership In Amazon’s 
environment node outages (due to failures and maintenance tasks) are often 
transient but may last for extended intervals.  A node outage rarely signifies 
a permanent departure and therefore should not result in rebalancing of the 
partition assignment or repair of the unreachable replicas.  Similarly, manual 
error could result in the unintentional startup of new Dynamo nodes.   For 
these reasons, it was deemed appropriate to use an explicit mechanism to 
initiate the addition and removal of nodes from a Dynamo ring. An administrator 
uses a command line tool or a browser to connect to a Dynamo node and issue a 
membership change to join a node to a ring or remove a node from a ring.  The 
node that serves the request writes the membership change and its time of issue 
to persistent store. The membership changes form a history because nodes can be 
removed and added back multiple times. A gossip-based protocol propagates 
membership changes and maintains an eventually consistent view of membership. 
Each node contacts a peer chosen at random every second and the two nodes 
efficiently reconcile their persisted membership change histories.   When a 
node starts for the first time, it chooses its set of tokens (virtual nodes in 
the consistent hash space) and maps nodes to their respective token sets. The 
mapping is persisted on disk and 
 202 212
 initially contains only the local node and token set.  The mappings stored at 
different Dynamo nodes are reconciled during the same communication exchange 
that reconciles the membership change histories. Therefore, partitioning and 
placement information also propagates via the gossip-based protocol and each 
storage node is aware of the token ranges handled by its peers. This allows 
each node to forward a key’s read/write operations to the right set of nodes 
directly.  4.8.2 External Discovery The mechanism described above could 
temporarily result in a logically partitioned Dynamo ring.  For example, the 
administrator could contact node A to join A to the ring, then contact node B 
to join B to the ring. In this scenario, nodes A and B would each consider 
itself a member of the ring, yet neither would be immediately aware of the 
other.  To prevent logical partitions, some Dynamo nodes play the role of 
seeds.  Seeds are nodes that are discovered via an external mechanism and are 
known to all nodes.  Because all nodes eventually reconcile their membership 
with a seed, logical partitions are highly unlikely.  Seeds can be obtained 
either from static configuration or from a configuration service. Typically 
seeds are fully functional nodes in the Dynamo ring. 4.8.3 Failure Detection 
Failure detection in Dynamo is used to avoid attempts to communicate with 
unreachable peers during get() and put() operations and when transferring 
partitions and hinted replicas.  For the purpose of avoiding failed attempts at 
communication, a purely local notion of failure detection is entirely 
sufficient: node A may consider node B failed if node B does not respond to 
node A’s messages (even if B is responsive to node C's messages).  In the 
presence of a steady rate of client requests generating internode communication 
in the Dynamo ring, a node A quickly discovers that a node B is unresponsive 
when B fails to respond to a message; Node A then uses alternate nodes to 
service requests that map to B's partitions; A periodically retries B to check 
for the latter's recovery.  In the absence of client requests to drive traffic 
between two nodes, neither node really needs to know whether the other is 
reachable and responsive. Decentralized failure detection protocols use a 
simple gossip-style protocol that enable each node in the system to learn about 
the arrival (or departure) of other nodes. For detailed information on 
decentralized failure detectors and the parameters affecting their accuracy, 
the interested reader is referred to [8]. Early designs of Dynamo used a 
decentralized failure detector to maintain a globally consistent view of 
failure state.  Later it was determined that the explicit node join and leave 
methods obviates the need for a global view of failure state. This is because 
nodes are notified of permanent node additions and removals by the explicit 
node join and leave methods and temporary node failures are detected by the 
individual nodes when they fail to communicate with others (while forwarding 
requests). 

4.9 Adding/Removing Storage Nodes When a new node (say X) is added into the 
system, it gets assigned a number of tokens that are randomly scattered on the 
ring. For every key range that is assigned to node X, there may be a number of 
nodes (less than or equal to N) that are currently in charge of handling keys 
that fall within its token range. Due to the allocation of key ranges to X, 
some existing nodes no longer have to some of their keys and these nodes 
transfer those keys to X. Let 
us consider a simple bootstrapping scenario where node X is added to the ring 
shown in Figure 2 between A and B. When X is added to the system, it is in 
charge of storing keys in the ranges (F, G], (G, A] and (A, X]. As a 
consequence, nodes B, C and D no longer have to store the keys in these 
respective ranges. Therefore, nodes B, C, and D will offer to and upon 
confirmation from X transfer the appropriate set of keys.  When a node is 
removed from the system, the reallocation of keys happens in a reverse process. 
  Operational experience has shown that this approach distributes the load of 
key distribution uniformly across the storage nodes, which is important to meet 
the latency requirements and to ensure fast bootstrapping. Finally, by adding a 
confirmation round between the source and the destination, it is made sure that 
the destination node does not receive any duplicate transfers for a given key 
range.

> The Dynamo web page on the Apache Cassandra site is missing content
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-14267
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14267
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Documentation and Website
>            Reporter: Kenneth Brotman
>            Priority: Major
>
> Content for the following topics is needed for 
> [http://cassandra.apache.org/doc/latest/architecture/dynamo.html]
> Gossip
> Failure Detection
> Token Ring/Ranges
>  
> Please post content.  Myself or someone else will take it from there.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to