[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Apache Wiki Sat, 18 Dec 2010 09:02:39 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "LargeDataSetConsiderations" page has been changed by PeterSchuller.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=7&rev2=8

--------------------------------------------------

    * The operating system's page cache is affected by compaction and repair 
operations. If you are relying on the page cache to keep the active set in 
memory, you may see significant degradation on performance as a result of 
compaction and repair operations.
     * Potential future improvements: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]], 
[[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]].
   * If you have column families with more than 143 million row keys in them, 
bloom filter false positive rates are likely to go up because of implementation 
concerns that limit the maximum size of a bloom filter. See 
[[ArchitectureInternals]] for information on how bloom filters are used. The 
negative effects of hitting this limit is that reads will start taking 
additional seeks to disk as the row count increases. Note that the effect you 
are seeing at any given moment will depend on when compaction was last run, 
because the bloom filter limit is per-sstable. It is an issue for column 
families because after a major compaction, the entire column family will be in 
a single sstable.
-   * This will likely be addressed in the future: See 
[[https://issues.apache.org/jira/browse/CASSANDRA-1608|CASSANDRA-1608]] and 
TODO: bigger-bf jira
+   * This will likely be addressed in the future: See 
[[https://issues.apache.org/jira/browse/CASSANDRA-1608|CASSANDRA-1608]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]]
   * Compaction is currently not concurrent, so only a single compaction runs 
at a time. This means that sstable counts may spike during larger compactions 
as several smaller sstables are written while a large compaction is happening. 
This can cause additional seeks on reads.
-   * TODO: link to parallel compaction JIRA ticket, file another one 
specifically for ensuring this issue is addressed (the pre-existing only deals 
with using multiple cores for throughput reasons)
+   * Potential future improvements: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]]
   * Consider the choice of file system. Removal of large files is notoriously 
slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs.
   * Adding nodes is a slow process if each node is responsible for a large 
amount of data. Plan for this; do not try to throw additional hardware at a 
cluster at the last minute.

[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Reply via email to