[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Apache Wiki Mon, 20 Dec 2010 00:34:28 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "LargeDataSetConsiderations" page has been changed by PeterSchuller.
The comment on this change is: Reflect that 1878 was fixed for 0.6.9 and 0.7.0.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=15&rev2=16

--------------------------------------------------

    * Repair operations can increase disk space demands (particularly in 0.6, 
less so in 0.7; TODO: provide actual maximum growth and what it depends on).
   * As your data set becomes larger and larger (assuming significantly larger 
than memory), you become more and more dependent on caching to elide I/O 
operations. As you plan and test your capacity, keep in mind that:
    * The cassandra row cache is in the JVM heap and unaffected (remains warm) 
by compactions and repair operations. This is a plus, but the down-side is that 
the row cache is not very memory efficient compared to the operating system 
page cache.
-   * The key cache is affected by compaction because it is per-sstable, and 
compaction moves data to new sstables.
+   * For 0.6.8 and below, the key cache is affected by compaction because it 
is per-sstable, and compaction moves data to new sstables.
-    * Soon no longer true as of: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1878|CASSANDRA-1878]]
+    * Was fixed/improved as of 
[[https://issues.apache.org/jira/browse/CASSANDRA-1878|CASSANDRA-1878]], for 
0.6.9 and 0.7.0.
    * The operating system's page cache is affected by compaction and repair 
operations. If you are relying on the page cache to keep the active set in 
memory, you may see significant degradation on performance as a result of 
compaction and repair operations.
     * Potential future improvements: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]], 
[[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]].
   * If you have column families with more than 143 million row keys in them, 
bloom filter false positive rates are likely to go up because of implementation 
concerns that limit the maximum size of a bloom filter. See 
[[ArchitectureInternals]] for information on how bloom filters are used. The 
negative effects of hitting this limit is that reads will start taking 
additional seeks to disk as the row count increases. Note that the effect you 
are seeing at any given moment will depend on when compaction was last run, 
because the bloom filter limit is per-sstable. It is an issue for column 
families because after a major compaction, the entire column family will be in 
a single sstable.

[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Reply via email to