Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "LargeDataSetConsiderations" page has been changed by PeterSchuller. http://wiki.apache.org/cassandra/LargeDataSetConsiderations -------------------------------------------------- New page: This page aims to to give some advise as to the issues one may need to consider when using Cassandra for large data sets. In particular, when the amount of data per node is large. The intent is not to make original claims, but to collect in one place some issues that are operationally relevant. Other parts of the wiki are highly recommended in order to fully understand the issues involved. This is a work in progress. IF you find information out of date (e.g., a JIRA ticket referenced has been resolved but this document has not been updated), please help by editing or e-mail:ing cassandra-user. Unless otherwise noted, the points refer to Cassandra 0.7 and above. * If you have column families with more than 143 million row keys in them, bloom filter false positive rates are likely to go up because of implementation concerns that limit the maximum size of a bloom filter. See [[ArchitectureInternals]] for information on how bloom filters are used. The negative effects of hitting this limit is that reads will start taking additional seeks to disk as the row count increases. Note that the effect you are seeing at any given moment will depend on when compaction was last run, because the bloom filter limit is per-sstable. It is an issue for column families because after a major compaction, the entire column family will be in a single sstable. * This will likely be addressed in the future: TODO: add JIRA links to the bigger-bf and the limit-sstable-size issue. * Compaction is currently not concurrent, so only a single compaction runs at a time. This means that sstable counts may spike during larger compactions as several smaller sstables are written while a large compaction is happening. This can cause additional seeks on reads. * TODO: link to parallel compaction JIRA ticket, file another one specifically for ensuring this issue is addressed (the pre-existing only deals with using multiple cores for throughput reasons) * Consider the choice of file system. Removal of large files is notoriously slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs.
