Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "LargeDataSetConsiderations" page has been changed by PeterSchuller. The comment on this change is: Reflect that CASSANDRA-1555 is fixed as of 0.7.1. http://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=16&rev2=17 -------------------------------------------------- * Was fixed/improved as of [[https://issues.apache.org/jira/browse/CASSANDRA-1878|CASSANDRA-1878]], for 0.6.9 and 0.7.0. * The operating system's page cache is affected by compaction and repair operations. If you are relying on the page cache to keep the active set in memory, you may see significant degradation on performance as a result of compaction and repair operations. * Potential future improvements: [[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]], [[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]]. - * If you have column families with more than 143 million row keys in them, bloom filter false positive rates are likely to go up because of implementation concerns that limit the maximum size of a bloom filter. See [[ArchitectureInternals]] for information on how bloom filters are used. The negative effects of hitting this limit is that reads will start taking additional seeks to disk as the row count increases. Note that the effect you are seeing at any given moment will depend on when compaction was last run, because the bloom filter limit is per-sstable. It is an issue for column families because after a major compaction, the entire column family will be in a single sstable. + * Prior to 0.7.1 (fixed in [[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]]), if you had column families with more than 143 million row keys in them, bloom filter false positive rates would be likely to go up because of implementation concerns that limited the maximum size of a bloom filter. See [[ArchitectureInternals]] for information on how bloom filters are used. The negative effects of hitting this limit is that reads will start taking additional seeks to disk as the row count increases. Note that the effect you are seeing at any given moment will depend on when compaction was last run, because the bloom filter limit is per-sstable. It is an issue for column families because after a major compaction, the entire column family will be in a single sstable. - * This will likely be addressed in the future: See [[https://issues.apache.org/jira/browse/CASSANDRA-1608|CASSANDRA-1608]] and [[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]] * Compaction is currently not concurrent, so only a single compaction runs at a time. This means that sstable counts may spike during larger compactions as several smaller sstables are written while a large compaction is happening. This can cause additional seeks on reads. * Potential future improvements: [[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]] and [[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]] * Consider the choice of file system. Removal of large files is notoriously slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects background unlink():ing of sstables that happens every now and then, and also affects start-up time (if there are sstables pending removal when a node is starting up, they are removed as part of the start-up proceess; it may thus be detrimental if removing a terrabyte of sstables takes an hour (numbers are ballparks, not accurately measured and depends on circumstances)).
