Hello. I'm developing a system that will require me to store large (<=4MB) columns in Cassandra. Right now I'm storing 1 column per row, in a single CF. The machines I have at my disposal are 32GB RAM machines with 10 SATA drives each. I would prefer to have a larger number of smaller nodes, but this is what I have to work with. Some issues that I have are: RAID0 Vs separate data dirs, and SizeTiered compaction Vs Leveled compaction. I will have approximately 2 times more writes than reads.
RAID0 would help me use more efficiently the total disk space available at each node, but tests have shown that under write load it behaves much worse than using separate data dirs, one per disk. I used a 3-node cluster, and the node with RAID0 kept getting behind the other two nodes which had separate data dirs. The problem with separate data dirs is that it seems to be difficult for Cassandra to use the space efficiently due to the compactions. I first tried the new Leveled compactions scheme, which seemed promising since it would create "small" files that could be scattered by the data dirs, but the IO necessary for this compaction scheme is enormous under write load. It was constantly working and it affected the write throughput because it slowed the flushing of memtables. I then tried tiered compaction and it performed better, but as it tends to create large SSTables they cannot be split across the multiple data dirs. What I'm thinking of doing now is using multiple data dirs, with tiered compaction, and dividing the input data in several (64) different CFs. This way smaller SSTables will be created and these can be split across the multiple data dirs. This will allow me to better use the available capacity and I will not need as much free space for compactions than I would if the SSTables were larger. Am I missing something here? Is this the best way to deal with this (abnormal) use case? Thanks and best regards, André Cruz
smime.p7s
Description: S/MIME cryptographic signature