[Cassandra Wiki] Update of "ArchitectureInternals_JP" by yukim

Apache Wiki Thu, 08 Apr 2010 08:45:28 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "ArchitectureInternals_JP" page has been changed by yukim.
http://wiki.apache.org/cassandra/ArchitectureInternals_JP?action=diff&rev1=14&rev2=15

--------------------------------------------------

  ## page was copied from ArchitectureInternals
- = General =
-  * Configuration file is parsed by !DatabaseDescriptor (which also has all 
the default values, if any)
-  * Thrift generates an API interface in Cassandra.java; the implementation is 
!CassandraServer, and !CassandraDaemon ties it together.
-  * !CassandraServer turns thrift requests into the internal equivalents, then 
!StorageProxy does the actual work, then !CassandraServer turns it back into 
thrift again
-  * !StorageService is kind of the internal counterpart to !CassandraDaemon.  
It handles turning raw gossip into the right internal state.
-  * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, 
etc. replicas of each key range.  Primary replica is always determined by the 
token ring (in !TokenMetadata) but you can do a lot of variation with the 
others.  !RackUnaware just puts replicas on the next N-1 nodes in the ring.  
!RackAware puts the first non-primary replica in the next node in the ring in 
ANOTHER data center than the primary; then the remaining replicas in the same 
as the primary.
-  * !MessagingService handles connection pooling and running internal commands 
on the appropriate stage (basically, a threaded executorservice).  Stages are 
set up in !StageManager; currently there are read, write, and stream stages.  
(Streaming is for when one node copies large sections of its SSTables to 
another, for bootstrap or relocation on the ring.)  The internal commands are 
defined in !StorageService; look for `registerVerbHandlers`.
+ = 概要 =
+  * 設定ファイルは!DatabaseDescriptorによって解析されます。(デフォルト値がある項目については、そのデフォルト値も定義しています。)
+  * 
ThriftによってCassanda.java内にAPIのインターフェースが生成されています。実装は!CassandraServerが行い、!CassandraDaemonがそれらを結びつけます。
+  * 
!CassandraServerがThriftのリクエストを内部処理に変換し、!StorageProxyが実際にリクエストの処理を行います。!CassandraServerはThrift用に結果を変換して返します。
+  * 
!StorageServiceは内部的に!CassandraDaemonに対応するようなものです。素のゴシッププロトコルを正確に内部状態に変換する役割を持ちます。
+  * 
!AbstractReplicationStrategyが2つ目、3つ目、...のレプリカを受け取るノードをコントロールします。最初のレプリカは(!TokenMetadata内の)トークンリングによって決められます。しかしそれ以降のレプリカに関してはいろいろなバリエーションをとることができます。!RackUnaware
 は単にリング上の次のN-1番目のノードにレプリカを配置します。!RackAware 
は2番目のレプリカを、最初のレプリカが配置されたのとは別のデータセンターにあるリング上の次のノードに配置します。そして残りのレプリカを最初のレプリカと同じように配置します。
+  * 
!MessagingServiceが接続プールと適切なステージ(基本的にはスレッド化された!ExecutorService)での内部コマンドの実行を制御します。ステージは!StageManager内に定義されています。現時点では、読み込み、書き込み、そしてストリームのステージが有ります。(ストリーミングはノードの起動時やリング上での再配置時に、あるノードから別のノードへSSTableの大部分をコピーする場合に使用されます。)
 内部コマンドは!StorageServiceで定義されます。`registerVerbHandlers` を参照してください。
  
- = Write path =
-  * !StorageProxy gets the nodes responsible for replicas of the keys from the 
!ReplicationStrategy, then sends !RowMutation messages to them.
-    * If nodes are changing position on the ring, "pending ranges" are 
associated with their destinations in !TokenMetadata and these are also written 
to.
-    * If nodes that should accept the write are down, but the remaining nodes 
can fulfill the requested !ConsistencyLevel, the writes for the down nodes will 
be sent to another node instead, with a header (a "hint") saying that data 
associated with that key should be sent to the replica node when it comes back 
up.  This is called HintedHandoff and reduces the "eventual" in "eventual 
consistency."  Note that HintedHandoff is only an '''optimization'''; 
ArchitectureAntiEntropy is responsible for restoring consistency more 
completely.
-  * on the destination node, !RowMutationVerbHandler hands the write first to 
!CommitLog.java, then to the Memtable for the appropriate !ColumnFamily 
(through Table.apply).
-  * When a Memtable is full, it gets sorted and written out as an SSTable 
asynchronously by !ColumnFamilyStore.switchMemtable
-    * When enough SSTables exist, they are merged by 
!ColumnFamilyStore.doFileCompaction
-      * Making this concurrency-safe without blocking writes or reads while we 
remove the old SSTables from the list and add the new one is tricky, because 
naive approaches require waiting for all readers of the old sstables to finish 
before deleting them (since we can't know if they have actually started opening 
the file yet; if they have not and we delete the file first, they will error 
out).  The approach we have settled on is to not actually delete old SSTables 
synchronously; instead we register a phantom reference with the garbage 
collector, so when no references to the SSTable exist it will be deleted.  (We 
also write a compaction marker to the file system so if the server is restarted 
before that happens, we clean out the old SSTables at startup time.)
-  * See [[ArchitectureSSTable]] and ArchitectureCommitLog for more details
+ = 書き込み時の流れ =
+  * 
!StorageProxyが!ReplicationStrategyからキーのレプリカを保持するノードを取得し、それらに!RowMutationメッセージを送信します。
+    * もしリング上でノードの位置が変更している場合、!TokenMetadata内で"保留中のレンジ"は目的地に関連付けられ、それらに書き込まれます。
+    * 
もし書き込み対象のノードが落ちていて、残りのノードで!ConsistencyLevelの要求を満たすことが出来る場合、落ちているノードに対する書き込みは他のノードに送信されます。その際に、そのノードが復帰した際にそのキーに関連付けられたデータはレプリカノードに送信されるべきだというヘッダ("ヒント")をつけて送信します。この処理は[[HintedHandoff_JP|HintedHandoff]]と呼ばれ、イベンチュアルコンシステンシーの"イベンチャル"に到達するまでの時間を減らします。[[HintedHandoff_JP|HintedHandoff]]は'''最適化'''に過ぎません。より完全な一貫性の保持については[[ArchitectureAntiEntropy|AntiEntropy]]が責任をもちます。
+  * 
受け取る側のノードでは、!RowMutationVerbHandlerが書き込みをまず最初に!CommitLog.javaに渡します。そして(Table.applyメソッド経由で)適切な!ColumnFamilyのMemtableへ書きこみます。
+  * 
Memtableが一杯になった場合、ソートされて!ColumnFamilyStore.switchMemtableの処理で非同期にSSTableとして書き出されます。
+    * SSTableがある程度存在する場合、!ColumnFamilyStore.doFileCompaction処理によってマージされます。
+      * 
古いSSTableを削除して新しいSSTableを作成している間に、書き込みや読み込みをブロックすることなくこの処理を安全に並列化するには注意が必要です。なぜなら単純なアプローチでは古いSSTableを削除する前にそれらを読み込みんでいるリーダーのすべてが読み込みを完了するのを待つ必要があります。(実際にファイルオープンが開始されたかを知ることはできないからです。もしまだ開始されておらず、ファイルの削除が先に実行された場合、読み込みはエラーになるでしょう。)
 
Cassandraでは、古いSSTableを同期して削除しないアプローチをとることになりました。代わりに、ガベージコレクタにファントム参照オブジェクト(!PhantomReference)を登録し、SSTableへの参照がなくなった時点で消去されるようにしました。(同時にファイルシステムへコンパクションの目印を書きこみます。こうしておくことでコンパクションが起こる前にサーバーが再起動してしまった場合でも、起動時に古いSSTableを消去できます。)
+  * 
詳細は[[ArchitectureSSTable]]と[[ArchitectureCommitLog|ArchitectureCommitLog]]を参照してください。
  
- = Read path =
-  * !StorageProxy gets the nodes responsible for replicas of the keys from the 
!ReplicationStrategy, then sends read messages to them
-    * This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a 
!RangeSliceReadCommand, depending
-  * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily 
or CFS.getRangeSlice and sends it back as a !ReadResponse
-    * For single-row requests, we use a !QueryFilter subclass to pick the data 
from the Memtable and SSTables that we are looking for.  The Memtable read is 
straightforward.  The SSTable read is a little different depending on which 
kind of request it is:
-      * If we are reading a slice of columns, we use the row-level column 
index to find where to start reading, and deserialize block-at-a-time (where 
"block" is the group of columns covered by a single index entry) so we can 
handle the "reversed" case without reading vast amounts into memory
-      * If we are reading a group of columns by name, we still use the column 
index to locate each column, but first we check the row-level bloom filter to 
see if we need to do anything at all
-    * The column readers provide an Iterator interface, so the filter can 
easily stop when it's done, without reading more columns than necessary
-      * Since we need to potentially merge columns from multiple SSTable 
versions, the reader iterators are combined through a !ReducingIterator, which 
takes an iterator of uncombined columns as input, and yields combined versions 
as output
-  * If a quorum read was requested, !StorageProxy waits for a majority of 
nodes to reply and makes sure the answers match before returning.  Otherwise, 
it returns the data reply as soon as it gets it, and checks the other replies 
for discrepancies in the background in !StorageService.doConsistencyCheck.  
This is called "read repair," and also helps achieve consistency sooner.
-    * As an optimization, !StorageProxy only asks the closest replica for the 
actual data; the other replicas are asked only to compute a hash of the data.
  
- = Gossip =
-  * based on "Efficient reconciliation and flow control for anti-entropy 
protocols:" http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
-  * See ArchitectureGossip for more details
+ = 読み取り時の流れ =
+  * 
!StorageProxyが!ReplicationStrategyからそのキーのレプリカを保持しているノードを取得し、それらに読み取りメッセージを送信します。
+    * 
メッセージは!SliceFromReadCommand、!SliceByNamesReadCommand、もしくは!RangeSliceReadCommandです。
+  * 
データを保持しているノードでは、!ReadVerbHandlerが!ColumnFamilyStore.getColumnFamilyを使用してデータを取得し、!ReadResponseとして送り返します。
+    * 
1行の問い合わせでは、!QueryFilterのサブクラスを使用して、MemtableとSSTableから探しているデータを取得します。Memtableからの読み込みは直接的です。SSTableからの読み取りはリクエストの種類に応じて多少異なります。
+      * 
カラムのスライスを読み込もうとしている場合、行レベルのカラムインデックスを用いて読み取りの開始場所を決定し、1度に1ブロックづつデシリアライズします。("ブロック"とはひとつのインデックスによってカバーされるカラムのグループです。)そうすることで、"逆の"場合に大部分をメモリに読み込む事なく処理できます。
+      * 
カラムのグループを名前で読み込もうとしている場合もカラムインデックスを用いて各カラムの位置を特定しますが、まず最初に行レベルのブルームフィルタをチェックし、そのファイルからの読み込みがそもそも必要かをチェックします。
+    * 
カラムの読み込みにはIteratorインターフェースが提供されています。このことにより、フィルタは読み込みが完了したらすぐに停止することができます。必要ないカラムを読み取ることはありません。
+      * 
潜在的に複数のSSTableのバージョンからカラムをマージする必要があるため、読み込みイテレータは!ReducingIteratorを使用して組み合わされます。まだ組み合わされていないカラムのイテレータをインプットとして受け取り、組み合わされたバージョンをアウトプットとして生成します。
+  * 
Quorumレベルの読み込みが要求された場合、!StorageProxyはノードの大部分が応答を返すのを待ちます。そして結果を返す前にそれらの応答がマッチするかを確認します。それ以外の場合では結果をすぐに返します。そして!StorageService.doConsistencyCheckを行い、バックグラウンドでレプリカの不整合をチェックします。これは"リードリペア(Read
 repair)"と呼ばれ、一貫性の保証を素早く行う手助けをします。
+    * 
最適化のため、!StorageProxyは一番近いレプリカへ実際のデータを問い合せます。他のレプリカはデータのハッシュ値を計算するためだけに問い合せます。
  
- = Failure detection =
-  * based on "The Phi accrual failure detector:" 
http://vsedach.googlepages.com/HDY04.pdf
  
+ = ゴシッププロトコル =
+  * "Efficient reconciliation and flow control for anti-entropy protocols" 
http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf をベースにしています。
+  * 詳細はArchitectureGossipを参照してください。
- = Further reading =
-  * The idea of dividing work into "stages" with separate thread pools comes 
from the famous SEDA paper: 
http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf
-  * Crash-only design is another broadly applied principle.  
[[http://lwn.net/Articles/191059/|Valerie Henson's LWN article]] is a good 
introduction
-  * Cassandra's distribution is closely related to the one presented in 
Amazon's Dynamo paper.  Read repair, adjustable consistency levels, hinted 
handoff, and other concepts are discussed there.  This is required background 
material: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html.  The 
related article on 
[[http://www.allthingsdistributed.com/2008/12/eventually_consistent.html|article
 on eventual consistency]] is also relevant.
-  * Cassandra's on-disk storage model is loosely based on sections 5.3 and 5.4 
of [[http://labs.google.com/papers/bigtable.html|the Bigtable paper]].
-  * Facebook's Cassandra team authored a paper on Cassandra for LADIS 09: 
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf. 
Most of the information there is applicable to Apache Cassandra (the main 
exception is the integration of !ZooKeeper).
  
+ = 故障の検出 =
+  * "The Phi accrual failure detector(Φ漸増型故障検出器)" 
http://vsedach.googlepages.com/HDY04.pdf をベースにしています。
+ 
+ = より理解を深めるために =
+  * タスクを"ステージ"に分割して別々のスレッドプールを割り当てるアイディアはSEDAの論文 
http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf から頂きました。
+  * 
「クラッシュオンリー(Crash-only)」な設計は、広範囲にわたって適用されている原則です。[[http://lwn.net/Articles/191059/|Valerie
 HensonのLWNにおける記事]]が入門には最適です。
+  * Cassandraの分散はAmazonのDynamo論文に記載されているものに非常に関連しています。Read 
repair、調整可能な一貫性レベル、Hinted 
Handoff、その他のコンセプトが議論されています。これはバックグラウンドの知識として必読のマテリアルです。http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
 
他に[[http://www.allthingsdistributed.com/2008/12/eventually_consistent.html|イベンチュアルコンシステンシーに関する記事]]も関連があります。
+  * 
Cassandraのディスク上のストレージモデルは[[http://labs.google.com/papers/bigtable.html|Bigtableの論文]]のセクション5.3と5.4にほぼ基づいています。
+  * 
FacebookのCassandraチームがLADIS09でCassandraの論文を発表しました。http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
 Apache Cassandraにもほとんどの情報を適用可能です。(!ZooKeeperとの統合部分が主な違いです。)
+

[Cassandra Wiki] Update of "ArchitectureInternals_JP" by yukim

Reply via email to