[Cassandra Wiki] Update of "CassandraHardware" by Jonat hanEllis

Apache Wiki Wed, 13 Jan 2010 07:42:48 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "CassandraHardware" page has been changed by JonathanEllis.
http://wiki.apache.org/cassandra/CassandraHardware?action=diff&rev1=4&rev2=5

--------------------------------------------------

  Cassandra assumes that all nodes have equal capacity.  Violating this 
assumption will lead to poor performance.  Rather than keeping your hardware 
price fixed and adding increasingly powerful machines as moore's law kicks in, 
keep capacity (relatively) fixed and add increasingly inexpensive ones.  (This 
is easier if you start with relatively powerful machines.)
  
  === Memory ===
- The most recently written data resides in memory tables (aka 
[[MemtableThresholds|memtables]]), but older data that has been flushed to disk 
can be kept in the OS's file-system cache. In other words, ''the more memory, 
the better'', with 1GB being the minimum recommended in a virtualized 
environment.  With dedicated hardware there is no reason to use less than 4GB, 
and at the high end, you see clusters with 16 or 32 GB.
+ The most recently written data resides in memory tables (aka 
[[MemtableThresholds|memtables]]), but older data that has been flushed to disk 
can be kept in the OS's file-system cache. In other words, ''the more memory, 
the better'', with 1GB being the minimum we typically recommended in a 
virtualized environment.  With dedicated hardware there is no reason to use 
less than 4GB, and at the high end, you see clusters with 16 or 32 GB (to 
handle data sets of multiple TB per machine).
  
  === CPU ===
  Many workloads will actually be CPU-bound in Cassandra before being 
memory-bound.  Cassandra is highly concurrent and will make good use of however 
many cores you can give it.  For high-end clusters, quad- or 8-core boxes are 
good.  If you're running on virtualized machines, consider using a provider 
such as Rackspace Cloud Servers that allows CPU bursting.
  
  === Disk ===
- The short answer here is, ''at least 2 disks'', one to keep your 
`CommitLogDirectory` on, the other to use in `DataFileDirectories`. The exact 
answer though depends a lot on your usage so it's important to understand what 
is going on here.
+ The short answer here is that ideally you will have at least 2 disks, one to 
keep your `CommitLogDirectory` on, the other to use in `DataFileDirectories`. 
The exact answer though depends a lot on your usage so it's important to 
understand what is going on here.
  
  Cassandra persists data to disk for two very different purposes. The first, 
when a new write is made so that it can be replayed after a crash or system 
shutdown. The second when thresholds are exceeded and memtables are flushed to 
disk as SSTables.
  
- Commit logs receive every write made to a Cassandra node and have the 
potential to block client operations, but they are only ever read on node 
start-up. SSTables writes on the other hand occur asynchronously, but are read 
to satisfy client look-ups. SSTables are also periodically merged and rewritten 
in a process called ''compaction''. Another important distinction is that 
commit logs are purged after the corresponding data has been flushed to disk as 
an SSTable, so `CommitLogDirectory` only holds uncommitted data while the 
directories in `DataFileDirectories` store all of the data written to a node.
+ Commit logs receive every write made to a Cassandra node and have the 
potential to block client operations, but they are only ever read on node 
start-up. SSTable (data file) writes on the other hand occur asynchronously, 
but are read to satisfy client look-ups. SSTables are also periodically merged 
and rewritten in a process called ''compaction''. Another important difference 
between commitlog and sstables is that commit logs are purged after the 
corresponding data has been flushed to disk as an SSTable, so 
`CommitLogDirectory` only holds uncommitted data while the directories in 
`DataFileDirectories` store all of the data written to a node.
  
- So to summarize, use a different device for your `CommitLogDirectory`; it 
needn't be large, but it should be fast enough to receive all of your writes. 
Then, use one or more devices for `DataFileDirectories` and make sure they are 
both large enough to house all of your data, and fast enough to satisfy your 
reads and to keep up with flushing and compaction.
+ So to summarize, if you use a different device for your `CommitLogDirectory` 
it needn't be large, but it should be fast enough to receive all of your 
writes. Then, use one or more devices for `DataFileDirectories` and make sure 
they are both large enough to house all of your data, and fast enough to 
satisfy your reads and to keep up with flushing and compaction.

[Cassandra Wiki] Update of "CassandraHardware" by Jonat hanEllis

Reply via email to