[Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis

Apache Wiki Tue, 25 Sep 2012 08:26:41 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "ArchitectureInternals" page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=27&rev2=28

Comment:
update read path

     * ConsistencyLevel determines how many replies to wait for.  See 
!WriteResponseHandler.determineBlockFor.  Interaction with pending ranges is a 
bit tricky; see https://issues.apache.org/jira/browse/CASSANDRA-833
     * If the FailureDetector says that we don't have enough nodes alive to 
satisfy the ConsistencyLevel, we fail the request with !UnavailableException
     * If the FD gives us the okay but writes time out anyway because of a 
failure after the request is sent or because of an overload scenario, 
!StorageProxy will write a "hint" locally to replay the write when the 
replica(s) timing out recover.  This is called HintedHandoff.  Note that HH 
does not prevent inconsistency entirely; either unclean shutdown or hardware 
failure can prevent the coordinating node from writing or replaying the hint. 
ArchitectureAntiEntropy is responsible for restoring consistency more 
completely.
+    * Cross-datacenter writes are not sent directly to each replica; instead, 
they are sent to a single replica, with a Header in !MessageOut telling that 
replica to forward to the other ones in that datacenter
   * on the destination node, !RowMutationVerbHandler uses Table.Apply to hand 
the write first to the !CommitLog, then to the Memtable for the appropriate 
!ColumnFamily.
   * When a Memtable is full, it gets sorted and written out as an SSTable 
asynchronously by !ColumnFamilyStore.maybeSwitchMemtable (so named because 
multiple concurrent calls to it will only flush once)
     * "Fullness" is monitored by !MeteredFlusher; the goal is to flush quickly 
enough that we don't OOM as new writes arrive while we still have to hang on to 
the memory of the old memtable during flush
@@ -26, +27 @@

  
  = Read path =
   * !StorageProxy gets the endpoints (nodes) responsible for replicas of the 
keys from the !ReplicationStrategy as a function of the row key (the key of the 
row being read)
-    * This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a 
!RangeSliceReadCommand, depending
+    * This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a 
!RangeSliceCommand, depending on the query type.  Secondary index queries are 
also part of !RangeSliceCommand.
   * !StorageProxy filters the endpoints to contain only those that are 
currently up/alive
   * !StorageProxy then sorts, by asking the endpoint snitch, the responsible 
nodes by "proximity".
     * The definition of "proximity" is up to the endpoint snitch
       * With a SimpleSnitch, proximity directly corresponds to proximity on 
the token ring.
       * With implementations based on AbstractNetworkTopologySnitch (such as 
PropertyFileSnitch), endpoints that are in the same rack are always considered 
"closer" than those that are not. Failing that, endpoints in the same data 
center are always considered "closer" than those that are not.
-      * The DynamicSnitch, typically enabled in the configuration, wraps 
whatever underlying snitch (such as SimpleSnitch and NetworkTopologySnitch) so 
as to dynamically adjust the perceived "closeness" of endpoints based on their 
recent performance. This is in an effort to try to avoid routing traffic to 
endpoints that are slow to respond.
+      * The DynamicSnitch, typically enabled in the configuration, wraps 
whatever underlying snitch (such as SimpleSnitch and PropertyFileSnitch) so as 
to dynamically adjust the perceived "closeness" of endpoints based on their 
recent performance. This is an effort to try to avoid routing more traffic to 
endpoints that are slow to respond.
   * !StorageProxy then arranges for messages to be sent to nodes as required:
     * The closest node (as determined by proximity sorting as described above) 
will be sent a command to perform an actual data read (i.e., return data to the 
co-ordinating node). 
     * As required by consistency level, additional nodes may be sent digest 
commands, asking them to perform the read locally but send back the digest only.
       * For example, at replication factor 3 a read at consistency level 
QUORUM would require one digest read in additional to the data read sent to the 
closest node. (See ReadCallback, instantiated by StorageProxy)
     * If read repair is enabled (probabilistically if read repair chance is 
somewhere between 0% and 100%), remaining nodes responsible for the row will be 
sent messages to compute the digest of the response. (Again, see ReadCallback, 
instantiated by StorageProxy)
-  * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily 
or CFS.getRangeSlice and sends it back as a !ReadResponse
+  * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily, 
CFS.getRangeSlice, or CFS.search for single-row reads, seq scans, and index 
scans, respectively, and sends it back as a !ReadResponse
     * The row is located by doing a binary search on the index in 
SSTableReader.getPosition
     * For single-row requests, we use a !QueryFilter subclass to pick the data 
from the Memtable and SSTables that we are looking for.  The Memtable read is 
straightforward.  The SSTable read is a little different depending on which 
kind of request it is:
       * If we are reading a slice of columns, we use the row-level column 
index to find where to start reading, and deserialize block-at-a-time (where 
"block" is the group of columns covered by a single index entry) so we can 
handle the "reversed" case without reading vast amounts into memory
       * If we are reading a group of columns by name, we still use the column 
index to locate each column, but first we check the row-level bloom filter to 
see if we need to do anything at all
     * The column readers provide an Iterator interface, so the filter can 
easily stop when it's done, without reading more columns than necessary
       * Since we need to potentially merge columns from multiple SSTable 
versions, the reader iterators are combined through a !ReducingIterator, which 
takes an iterator of uncombined columns as input, and yields combined versions 
as output
+      * Single-row reads use !CollationController to determine which sstables 
are relevant -- for instance, if we're requesting column X, and we've read a 
value for X from sstable A at time T1, then any sstables whose maximum 
timestamp is less than T1 can be ignored.
  
  In addition:
   * At any point if a message is destined for the local node, the appropriate 
piece of work (data read or digest read) is directly submitted to the 
appropriate local stage (see StageManager) rather than going through messaging 
over the network.

[Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis

Reply via email to