[
https://issues.apache.org/jira/browse/CASSANDRA-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381947#comment-16381947
]
Kenneth Brotman edited comment on CASSANDRA-14267 at 3/1/18 1:15 PM:
---------------------------------------------------------------------
>From the Dynamo paper itself:
>[https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf]
4.2 Partitioning Algorithm One of the key design requirements for Dynamo is
that it must scale incrementally. This requires a mechanism to dynamically
partition the data over the set of nodes (i.e., storage hosts) in the system.
Dynamo’s partitioning scheme relies on consistent hashing to distribute the
load across multiple storage hosts. In consistent hashing [10], the output
range of a hash function is treated as a fixed circular space or “ring” (i.e.
the largest hash value wraps around to the smallest hash value). Each node in
the system is assigned a random value within this space which represents its
“position” on the ring. Each data item identified by a key is assigned to a
node by hashing the data item’s key to yield its position on the ring, and then
walking the ring clockwise to find the first node with a position larger than
the item’s position.
Thus, each node becomes responsible for the region in the ring between it and
its predecessor node on the ring. The principle advantage of consistent hashing
is that departure or arrival of a node only affects its immediate neighbors and
other nodes remain unaffected.
4.8 Membership and Failure Detection 4.8.1 Ring Membership In Amazon’s
environment node outages (due to failures and maintenance tasks) are often
transient but may last for extended intervals. A node outage rarely signifies
a permanent departure and therefore should not result in rebalancing of the
partition assignment or repair of the unreachable replicas. Similarly, manual
error could result in the unintentional startup of new Dynamo nodes. For
these reasons, it was deemed appropriate to use an explicit mechanism to
initiate the addition and removal of nodes from a Dynamo ring. An administrator
uses a command line tool or a browser to connect to a Dynamo node and issue a
membership change to join a node to a ring or remove a node from a ring. The
node that serves the request writes the membership change and its time of issue
to persistent store. The membership changes form a history because nodes can be
removed and added back multiple times. A gossip-based protocol propagates
membership changes and maintains an eventually consistent view of membership.
Each node contacts a peer chosen at random every second and the two nodes
efficiently reconcile their persisted membership change histories. When a
node starts for the first time, it chooses its set of tokens (virtual nodes in
the consistent hash space) and maps nodes to their respective token sets. The
mapping is persisted on disk and
202 212
initially contains only the local node and token set. The mappings stored at
different Dynamo nodes are reconciled during the same communication exchange
that reconciles the membership change histories. Therefore, partitioning and
placement information also propagates via the gossip-based protocol and each
storage node is aware of the token ranges handled by its peers. This allows
each node to forward a key’s read/write operations to the right set of nodes
directly. 4.8.2 External Discovery The mechanism described above could
temporarily result in a logically partitioned Dynamo ring. For example, the
administrator could contact node A to join A to the ring, then contact node B
to join B to the ring. In this scenario, nodes A and B would each consider
itself a member of the ring, yet neither would be immediately aware of the
other. To prevent logical partitions, some Dynamo nodes play the role of
seeds. Seeds are nodes that are discovered via an external mechanism and are
known to all nodes. Because all nodes eventually reconcile their membership
with a seed, logical partitions are highly unlikely. Seeds can be obtained
either from static configuration or from a configuration service. Typically
seeds are fully functional nodes in the Dynamo ring. 4.8.3 Failure Detection
Failure detection in Dynamo is used to avoid attempts to communicate with
unreachable peers during get() and put() operations and when transferring
partitions and hinted replicas. For the purpose of avoiding failed attempts at
communication, a purely local notion of failure detection is entirely
sufficient: node A may consider node B failed if node B does not respond to
node A’s messages (even if B is responsive to node C's messages). In the
presence of a steady rate of client requests generating internode communication
in the Dynamo ring, a node A quickly discovers that a node B is unresponsive
when B fails to respond to a message; Node A then uses alternate nodes to
service requests that map to B's partitions; A periodically retries B to check
for the latter's recovery. In the absence of client requests to drive traffic
between two nodes, neither node really needs to know whether the other is
reachable and responsive. Decentralized failure detection protocols use a
simple gossip-style protocol that enable each node in the system to learn about
the arrival (or departure) of other nodes. For detailed information on
decentralized failure detectors and the parameters affecting their accuracy,
the interested reader is referred to [8]. Early designs of Dynamo used a
decentralized failure detector to maintain a globally consistent view of
failure state. Later it was determined that the explicit node join and leave
methods obviates the need for a global view of failure state. This is because
nodes are notified of permanent node additions and removals by the explicit
node join and leave methods and temporary node failures are detected by the
individual nodes when they fail to communicate with others (while forwarding
requests).
was (Author: kenbrotman):
>From the Dynamo paper itself:
>[https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf]
4.8 Membership and Failure Detection 4.8.1 Ring Membership In Amazon’s
environment node outages (due to failures and maintenance tasks) are often
transient but may last for extended intervals. A node outage rarely signifies
a permanent departure and therefore should not result in rebalancing of the
partition assignment or repair of the unreachable replicas. Similarly, manual
error could result in the unintentional startup of new Dynamo nodes. For
these reasons, it was deemed appropriate to use an explicit mechanism to
initiate the addition and removal of nodes from a Dynamo ring. An administrator
uses a command line tool or a browser to connect to a Dynamo node and issue a
membership change to join a node to a ring or remove a node from a ring. The
node that serves the request writes the membership change and its time of issue
to persistent store. The membership changes form a history because nodes can be
removed and added back multiple times. A gossip-based protocol propagates
membership changes and maintains an eventually consistent view of membership.
Each node contacts a peer chosen at random every second and the two nodes
efficiently reconcile their persisted membership change histories. When a
node starts for the first time, it chooses its set of tokens (virtual nodes in
the consistent hash space) and maps nodes to their respective token sets. The
mapping is persisted on disk and
202 212
initially contains only the local node and token set. The mappings stored at
different Dynamo nodes are reconciled during the same communication exchange
that reconciles the membership change histories. Therefore, partitioning and
placement information also propagates via the gossip-based protocol and each
storage node is aware of the token ranges handled by its peers. This allows
each node to forward a key’s read/write operations to the right set of nodes
directly. 4.8.2 External Discovery The mechanism described above could
temporarily result in a logically partitioned Dynamo ring. For example, the
administrator could contact node A to join A to the ring, then contact node B
to join B to the ring. In this scenario, nodes A and B would each consider
itself a member of the ring, yet neither would be immediately aware of the
other. To prevent logical partitions, some Dynamo nodes play the role of
seeds. Seeds are nodes that are discovered via an external mechanism and are
known to all nodes. Because all nodes eventually reconcile their membership
with a seed, logical partitions are highly unlikely. Seeds can be obtained
either from static configuration or from a configuration service. Typically
seeds are fully functional nodes in the Dynamo ring. 4.8.3 Failure Detection
Failure detection in Dynamo is used to avoid attempts to communicate with
unreachable peers during get() and put() operations and when transferring
partitions and hinted replicas. For the purpose of avoiding failed attempts at
communication, a purely local notion of failure detection is entirely
sufficient: node A may consider node B failed if node B does not respond to
node A’s messages (even if B is responsive to node C's messages). In the
presence of a steady rate of client requests generating internode communication
in the Dynamo ring, a node A quickly discovers that a node B is unresponsive
when B fails to respond to a message; Node A then uses alternate nodes to
service requests that map to B's partitions; A periodically retries B to check
for the latter's recovery. In the absence of client requests to drive traffic
between two nodes, neither node really needs to know whether the other is
reachable and responsive. Decentralized failure detection protocols use a
simple gossip-style protocol that enable each node in the system to learn about
the arrival (or departure) of other nodes. For detailed information on
decentralized failure detectors and the parameters affecting their accuracy,
the interested reader is referred to [8]. Early designs of Dynamo used a
decentralized failure detector to maintain a globally consistent view of
failure state. Later it was determined that the explicit node join and leave
methods obviates the need for a global view of failure state. This is because
nodes are notified of permanent node additions and removals by the explicit
node join and leave methods and temporary node failures are detected by the
individual nodes when they fail to communicate with others (while forwarding
requests).
> The Dynamo web page on the Apache Cassandra site is missing content
> -------------------------------------------------------------------
>
> Key: CASSANDRA-14267
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14267
> Project: Cassandra
> Issue Type: Improvement
> Components: Documentation and Website
> Reporter: Kenneth Brotman
> Priority: Major
>
> Content for the following topics is needed for
> [http://cassandra.apache.org/doc/latest/architecture/dynamo.html]
> Gossip
> Failure Detection
> Token Ring/Ranges
>
> Please post content. Myself or someone else will take it from there.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]