[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Apache Wiki Sat, 18 Dec 2010 09:19:37 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "LargeDataSetConsiderations" page has been changed by PeterSchuller.
The comment on this change is: Talk about index sampling.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=10&rev2=11

--------------------------------------------------

    * Potential future improvements: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]]
   * Consider the choice of file system. Removal of large files is notoriously 
slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs.
   * Adding nodes is a slow process if each node is responsible for a large 
amount of data. Plan for this; do not try to throw additional hardware at a 
cluster at the last minute.
+  * Cassandra will read through sstable index files on start-up, doing what is 
known as "index sampling". This is used to keep a subset (currently and by 
default, 1 out of 100) of keys and and their on-disk location in the index, in 
memory. See [[ArchitectureInternals]]. This means that the larger the index 
files are, the longer it takes to perform this sampling. Thus, for very large 
indexes (typically when you have a very large number of keys) the index 
sampling on start-up may be a significant issue.

[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Reply via email to