[Cassandra Wiki] Update of "ArchitectureInternals_ZH" by HubertChang

Apache Wiki Tue, 13 Apr 2010 01:18:51 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "ArchitectureInternals_ZH" page has been changed by HubertChang.
http://wiki.apache.org/cassandra/ArchitectureInternals_ZH?action=diff&rev1=16&rev2=17

--------------------------------------------------

  ## page was copied from ArchitectureInternals
- = General =
-  * Configuration file is parsed by !DatabaseDescriptor (which also has all 
the default values, if any)
-  * Thrift generates an API interface in Cassandra.java; the implementation is 
!CassandraServer, and !CassandraDaemon ties it together.
-  * !CassandraServer turns thrift requests into the internal equivalents, then 
!StorageProxy does the actual work, then !CassandraServer turns it back into 
thrift again
-  * !StorageService is kind of the internal counterpart to !CassandraDaemon.  
It handles turning raw gossip into the right internal state.
-  * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, 
etc. replicas of each key range.  Primary replica is always determined by the 
token ring (in !TokenMetadata) but you can do a lot of variation with the 
others.  !RackUnaware just puts replicas on the next N-1 nodes in the ring.  
!RackAware puts the first non-primary replica in the next node in the ring in 
ANOTHER data center than the primary; then the remaining replicas in the same 
as the primary.
-  * !MessagingService handles connection pooling and running internal commands 
on the appropriate stage (basically, a threaded executorservice).  Stages are 
set up in !StageManager; currently there are read, write, and stream stages.  
(Streaming is for when one node copies large sections of its SSTables to 
another, for bootstrap or relocation on the ring.)  The internal commands are 
defined in !StorageService; look for `registerVerbHandlers`.
+ ##master-page:ArchitectureInternals
+ #format wiki
+ #language zh
+ == 概述 ==
+  * 
对Cassandra的访问通过Thrift或者Avro这些服务开发框架进行支持。这些开发框架支持一定的客户端和服务器端的访问界面代码生成。在Cassandra数据库中，它生成Cassandra.java，内部包含Cassandra.IFace,
 Cassandra.IFace的客户端实现为Cassandra.Client, 服务器端实现是CassandraServer。（译者加入）
+  * 
Cassandra的启动在CassandraDaemon内执行，启动时通过DatabaseDescriptor完成对配置文件的解析（在需要的情况下DatabaseDescriptor提供一些配置的默认值)
 
+  * 启动后，!CassandraDaemon对客户端提供网络连接的能力；其接收的客户端请求由!CassandraServer响应和执行。
+  * 
CassandraServer把客户端请求转换成一一对应的内部请求。StorageProxy完成这些请求相应的具体工作。其结果由CassandraServer返回给客户端。
+  * 
StorageService处理Cassandra集群内部之间的请求和响应，可以看作它是集群内部的CassandraDaemon。它把集群内部的原始gossip转换成对应的内部状态。(译者注：StroageProxy的方法最后调用StorageService来完成在集群内的访问)
+  * 
AbstractReplicationStrategy控制每一个键值范围对应的第二份、第三份复制对应的节点。主本数据存储节点由该数据对应的Token以及其它变量所决定，
　
如果复制策略是机架不相关的，主本对应的复制存储在Token序列对应的接下来的N-1个节点上。如果复制策略是机架相关的，首先主本对应的一个复制是存储在另外一个机架上，而其它N-2(假设N>2)个复制和主本处于同一机架，存储在Token序列内依次的N-2个节点上。
+  * 
MessagingService负责处理内部的连接池，以及在(通过一个多线程的Executorservice)合适的stage上运行内部命令。stage由StorageManager进行管理，
　现在有读stage, 写stage,和流处理stage(在Cassandra 
bootstrap或者token的重新定义时，需要移动大量sstable的数据，这个时候进行的就是Streaming的处理)。　
(译者注：stage不是阶段，也不是舞台，是一个有时空概念的定义，具体的内涵请查阅Cassandra源代码或者读SEDA方面的资料)。stage上运行的内部命令由StorageService定义，可参考"registerVerbHandlers"
  
- = Write path =
-  * !StorageProxy gets the nodes responsible for replicas of the keys from the 
!ReplicationStrategy, then sends !RowMutation messages to them.
-    * If nodes are changing position on the ring, "pending ranges" are 
associated with their destinations in !TokenMetadata and these are also written 
to.
-    * If nodes that should accept the write are down, but the remaining nodes 
can fulfill the requested !ConsistencyLevel, the writes for the down nodes will 
be sent to another node instead, with a header (a "hint") saying that data 
associated with that key should be sent to the replica node when it comes back 
up.  This is called HintedHandoff and reduces the "eventual" in "eventual 
consistency."  Note that HintedHandoff is only an '''optimization'''; 
ArchitectureAntiEntropy is responsible for restoring consistency more 
completely.
-  * on the destination node, !RowMutationVerbHandler uses Table.Apply to hand 
the write first to !CommitLog.java, then to the Memtable for the appropriate 
!ColumnFamily.
-  * When a Memtable is full, it gets sorted and written out as an SSTable 
asynchronously by !ColumnFamilyStore.switchMemtable
-    * When enough SSTables exist, they are merged by 
!ColumnFamilyStore.doFileCompaction
-      * Making this concurrency-safe without blocking writes or reads while we 
remove the old SSTables from the list and add the new one is tricky, because 
naive approaches require waiting for all readers of the old sstables to finish 
before deleting them (since we can't know if they have actually started opening 
the file yet; if they have not and we delete the file first, they will error 
out).  The approach we have settled on is to not actually delete old SSTables 
synchronously; instead we register a phantom reference with the garbage 
collector, so when no references to the SSTable exist it will be deleted.  (We 
also write a compaction marker to the file system so if the server is restarted 
before that happens, we clean out the old SSTables at startup time.)
-  * See [[ArchitectureSSTable]] and ArchitectureCommitLog for more details
- 
- = Read path =
-  * !StorageProxy gets the nodes responsible for replicas of the keys from the 
!ReplicationStrategy, then sends read messages to them
-    * This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a 
!RangeSliceReadCommand, depending
-  * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily 
or CFS.getRangeSlice and sends it back as a !ReadResponse
-    * The row is located by doing a binary search on the index in 
SSTableReader.getPosition
-    * For single-row requests, we use a !QueryFilter subclass to pick the data 
from the Memtable and SSTables that we are looking for.  The Memtable read is 
straightforward.  The SSTable read is a little different depending on which 
kind of request it is:
-      * If we are reading a slice of columns, we use the row-level column 
index to find where to start reading, and deserialize block-at-a-time (where 
"block" is the group of columns covered by a single index entry) so we can 
handle the "reversed" case without reading vast amounts into memory
-      * If we are reading a group of columns by name, we still use the column 
index to locate each column, but first we check the row-level bloom filter to 
see if we need to do anything at all
-    * The column readers provide an Iterator interface, so the filter can 
easily stop when it's done, without reading more columns than necessary
-      * Since we need to potentially merge columns from multiple SSTable 
versions, the reader iterators are combined through a !ReducingIterator, which 
takes an iterator of uncombined columns as input, and yields combined versions 
as output
-  * If a quorum read was requested, !StorageProxy waits for a majority of 
nodes to reply and makes sure the answers match before returning.  Otherwise, 
it returns the data reply as soon as it gets it, and checks the other replies 
for discrepancies in the background in !StorageService.doConsistencyCheck.  
This is called "read repair," and also helps achieve consistency sooner.
-    * As an optimization, !StorageProxy only asks the closest replica for the 
actual data; the other replicas are asked only to compute a hash of the data.
- 
- = Gossip =
-  * based on "Efficient reconciliation and flow control for anti-entropy 
protocols:" http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
-  * See ArchitectureGossip for more details
- 
- = Failure detection =
-  * based on "The Phi accrual failure detector:" 
http://vsedach.googlepages.com/HDY04.pdf
- 
- = Further reading =
-  * The idea of dividing work into "stages" with separate thread pools comes 
from the famous SEDA paper: 
http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf
-  * Crash-only design is another broadly applied principle.  
[[http://lwn.net/Articles/191059/|Valerie Henson's LWN article]] is a good 
introduction
-  * Cassandra's distribution is closely related to the one presented in 
Amazon's Dynamo paper.  Read repair, adjustable consistency levels, hinted 
handoff, and other concepts are discussed there.  This is required background 
material: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html.  The 
related article on 
[[http://www.allthingsdistributed.com/2008/12/eventually_consistent.html|article
 on eventual consistency]] is also relevant.
-  * Cassandra's on-disk storage model is loosely based on sections 5.3 and 5.4 
of [[http://labs.google.com/papers/bigtable.html|the Bigtable paper]].
-  * Facebook's Cassandra team authored a paper on Cassandra for LADIS 09: 
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf. 
Most of the information there is applicable to Apache Cassandra (the main 
exception is the integration of !ZooKeeper).
-

[Cassandra Wiki] Update of "ArchitectureInternals_ZH" by HubertChang

Reply via email to