Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "ArchitectureOverview" page has been changed by tuxracer69. http://wiki.apache.org/cassandra/ArchitectureOverview?action=diff&rev1=1&rev2=2 -------------------------------------------------- Architecture details - O(1) node lookup Explicit replication Eventually consistent + * O(1) node lookup + * Explicit replication + * Eventually consistent Architecture layers - Messaging service Gossip Failure detection Cluster state Partitioner Replication Commit log Memtable SSTable Indexes Compaction Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools - Writes + + * Messaging service + * Gossip + * Failure detection + * Cluster state + * Partitioner + * Replication + + * Commit log + * Memtable + * SSTable + * Indexes + * Compaction + + * Tombstones + * Hinted handoff + * Read repair + * Bootstrap + * Monitoring + * Admin tools + + == Writes == Any node Partitioner Commitlog, memtable SSTable Compaction Wait for W responses + Write model: + There are two write modes: + * ''Quorum write'': blocks until quorum is reached + * ''Async write'': sends request to any node. That node will push the data to appropriate nodes but return to client immediately + If node down, then write to another node with a hint saying where it should be written two. Harvester every 15 min goes through and find hints and moves the data to the appropriate node + === Write path === + At write time, + * you first write to a '''disk commit log''' (sequential) + * After write to log it is sent to the appropriate nodes + * Each node receiving write first records it in a local log, then makes update to appropriate '''memtables''' (one for each column family). A Memtable is Cassandra's in-memory representation of key/value pairs + before the data gets flushed to disk as an SSTable. + * '''Memtables''' are flushed to disk when: + * Out of space + * Too many keys (128 is default) + * Time duration (client provided – no cluster clock) + * When memtables written out two files go out: + * Data File ('''SSTable'''). A SSTable (terminology borrowed from Google) stands for Sorted Strings Table and is a file of key/value string pairs, sorted by keys. + * Index File ('''SSTable Index'''). (Similar to Hadoop !MapFile / Tfile) + * (Key, offset) pairs (points into data file) + * Bloom filter (all keys in data file) + * When a commit log has had all its column families pushed to disk, it is deleted + * '''Compaction''': Data files accumulate over time. Periodically data files are merged sorted into a new file (and creates new index) + * Merge keys + * Combine columns + * Discard tombstones + == Remove == - Memtable / SSTable - - Disk - Commit log - - SSTable format - - - Key / data - - SSTable Indexes - - - Bloom filter Key Column - - - - - - (Similar to Hadoop MapFile / Tfile) - - Compaction - - - Merge keys Combine columns Discard tombstones - - - - - - Remove Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction Read repair complicates things a little Eventually consistent complicates things more Solution: configurable delay before tombstone GC, after which tombstones are not repaired @@ -154, +171 @@ - Read path + == Read path == Any node Partitioner Wait for R responses Wait for N R responses in the background and perform read repair
