[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Apache Wiki Sat, 18 Dec 2010 08:28:50 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "LargeDataSetConsiderations" page has been changed by PeterSchuller.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=3&rev2=4

--------------------------------------------------

  This page aims to to give some advise as to the issues one may need to 
consider when using Cassandra for large data sets. In particular, when the 
amount of data per node is large. The intent is not to make original claims, 
but to collect in one place some issues that are operationally relevant. Other 
parts of the wiki are highly recommended in order to fully understand the 
issues involved.
  
- This is a work in progress. IF you find information out of date (e.g., a JIRA 
ticket referenced has been resolved but this document has not been updated), 
please help by editing or e-mail:ing cassandra-user.
+ This is a work in progress. If you find information out of date (e.g., a JIRA 
ticket referenced has been resolved but this document has not been updated), 
please help by editing or e-mail:ing cassandra-user.
  
  Unless otherwise noted, the points refer to Cassandra 0.7 and above.
  
   * Disk space usage in Cassandra can vary fairly suddenly over time. If you 
have significant amounts of data such that available disk space is not 
significantly higher than usage, consider:
-   * Compaction of a column family can up to double the disk space used by 
said column family (in the case of a major compaction and no deletions).
+   * Compaction of a column family can up to double the disk space used by 
said column family (in the case of a major compaction and no deletions). If 
your data is predominantly made up of a single, or a select few, column 
families then doubling the disk space for a CF may be a significant amount 
compared to your total disk usage.
    * Repair operations can increase disk space demands (particularly in 0.6, 
less so in 0.7; TODO: provide actual maximum growth and what it depends on).
-  * As your data set becomes larger and larger (assuming significantly larger 
than memory), you become more and more dependent on caching to elide I/O 
operations. As you plan and test your capacity, keep min mind that:
+  * As your data set becomes larger and larger (assuming significantly larger 
than memory), you become more and more dependent on caching to elide I/O 
operations. As you plan and test your capacity, keep in mind that:
-   * The cassandra row cache is in the JVM heap and un-affected (remains warm) 
by compactions and repair operations.
+   * The cassandra row cache is in the JVM heap and unaffected (remains warm) 
by compactions and repair operations. This is a plus, but the down-side is that 
the row cache is not very memory efficient compared to the operating system 
page cache.
    * The key cache is affected by compaction and repair.
     * Soon no longer true as of: TODO: insert jira ticket link
    * The operating system's page cache is affected by compaction and repair 
operations. If you are relying on the page cache to keep the active set in 
memory, you may see significant degradation on performance as a result of 
compaction and repair operations.
@@ -20, +20 @@

   * Compaction is currently not concurrent, so only a single compaction runs 
at a time. This means that sstable counts may spike during larger compactions 
as several smaller sstables are written while a large compaction is happening. 
This can cause additional seeks on reads.
    * TODO: link to parallel compaction JIRA ticket, file another one 
specifically for ensuring this issue is addressed (the pre-existing only deals 
with using multiple cores for throughput reasons)
   * Consider the choice of file system. Removal of large files is notoriously 
slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs.
+  * Adding nodes is a slow process if each node is responsible for a large 
amount of data. Plan for this; do not try to throw additional hardware at a 
cluster at the last minute.

[Cassandra Wiki] Update of "LargeDataSetConsiderations" by PeterSchuller

Reply via email to