[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Apache Wiki Tue, 27 Dec 2011 10:34:03 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "LargeDataSetConsiderations" page has been changed by PeterSchuller:
http://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=20&rev2=21

   * Cassandra will read through sstable index files on start-up, doing what is 
known as "index sampling". This is used to keep a subset (currently and by 
default, 1 out of 100) of keys and and their on-disk location in the index, in 
memory. See [[ArchitectureInternals]]. This means that the larger the index 
files are, the longer it takes to perform this sampling. Thus, for very large 
indexes (typically when you have a very large number of keys) the index 
sampling on start-up may be a significant issue.
   * A negative side-effect of a large row-cache is start-up time. The periodic 
saving of the row cache information only saves the keys that are cached; the 
data has to be pre-fetched on start-up. On a large data set, this is probably 
going to be seek-bound and the time it takes to warm up the row cache will be 
linear with respect to the row cache size (assuming sufficiently large amounts 
of data that the seek bound I/O is not subject to optimization by disks).
    * Potential future improvement: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1625|CASSANDRA-1625]].
-  * The total number of rows per node correlates directly with the size of 
bloom filters and sampled index entries. Expect the base memory requirement of 
a node to increase linearly with the number of keys (assuming the average row 
key size remains constant).
+  * The total number of rows per node correlates directly with the size of 
bloom filters and sampled index entries. Expect the base memory requirement of 
a node to increase linearly with the number of keys (assuming the average row 
key size remains constant). If you are not using caching at all (e.g. you are 
doing analysis type workloads), expect these two to be the two biggest 
consumers of memory.
    * You can decrease the memory use due to index sampling by changing the 
index sampling interval in cassandra.yaml
    * You should soon be able to tweak the bloom filter sizes too once 
[[https://issues.apache.org/jira/browse/CASSANDRA-3497|CASSANDRA-3497]] is done

[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Reply via email to