[Cassandra Wiki] Update of "ArchitectureOverview" by tu xracer69

Apache Wiki Tue, 17 Nov 2009 08:52:09 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "ArchitectureOverview" page has been changed by tuxracer69.
http://wiki.apache.org/cassandra/ArchitectureOverview?action=diff&rev1=6&rev2=7

--------------------------------------------------

     * Data File ('''SSTable'''). A SSTable (terminology borrowed from Google) 
stands for Sorted Strings Table and is a file of key/value string pairs, sorted 
by keys.
     * Index File ('''SSTable Index'''). (Similar to Hadoop !MapFile / Tfile)
       * (Key, offset) pairs (points into data file)
-      * Bloom filter (all keys in data file)
+      * '''Bloom filter''' (all keys in data file). A 
[[http://en.wikipedia.org/wiki/Bloom_filter|Bloom filter]], is a 
space-efficient probabilistic data structure that is used to test whether an 
element is a member of a set. False positives are possible, but false negatives 
are not. Cassandra uses bloom filters to save IO when performing a key lookup: 
each SSTable has a bloom filter associated with it that Cassandra checks before 
doing any disk seeks, making queries for keys that don't exist almost free. 
Bloom filters are surprisingly simple: divide a memory area into buckets (one 
bit per bucket for a standard bloom filter; more -typically four - for a 
counting bloom filter). To insert a key, generate several hashes per key, and 
mark the buckets for each hash. To check if a key is present, check each 
bucket; if any bucket is empty, the key was never inserted in the filter. If 
all buckets are non-empty, though, the key is only probably inserted - other 
keys' hashes could have covered the same buckets. See 
[[http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html|All 
you ever wanted to know about writing bloom filters]] for details and in 
particular why getting a really good output distribution is important.
+ 
+ 
+ 
   * When a commit log has had all its column families pushed to disk, it is 
deleted
   * '''Compaction''': Data files accumulate over time.  Periodically data 
files are merged sorted into a new file (and creates new index)
     * Merge keys

[Cassandra Wiki] Update of "ArchitectureOverview" by tu xracer69

Reply via email to