Cassandra as a key/object store for many small (10-60k) files

Jonathan Guberman Fri, 05 May 2017 12:20:06 -0700

Hello,

We’re currently testing Cassandra for use as a pure key-object store for data 
blobs around 10kB - 60kB each. Our use case is storing on the order of 10 
billion objects with about 5-20 million new writes per day. A written object 
will never be updated or deleted. Objects will be read at least once, some time 
within 10 days of being written. This will generally happen as a batch; that 
is, all of the images written on a particular day will be read together at the 
same time. This batch read will only happen one time; future reads will happen 
on individual objects, with no grouping, and they will follow a long-tail 
distribution, with popular objects read thousands of times per year but most 
read never or virtually never.


I’ve set up a small four node test cluster and have written test scripts to 
benchmark writing and reading our data. The table I’ve set up is very simple: 
an ascii primary key column with the object ID and a blob column for the data. 
All other settings were left at their defaults.
 
I’ve found write speeds to be very fast most of the time. However, 
periodically, writes will slow to a crawl for anywhere between half an hour to 
two hours, after which speeds recover to their previous levels. I assume this 
is some sort of data compaction or flushing to disk, but I haven’t been able to 
figure out the exact cause.

Read speeds have been more disappointing. Cached reads are very fast, but 
random read speed averages about 2 MB/sec, which is too slow when we need to 
read out a batch of several million objects. I don’t think it’s reasonable to 
assume that these rows will all still be cached by the time we need to read 
them for that first large batch read.

My general question is whether anyone has any suggestions for how to improve 
performance for our use case. More specifically:

- Is there a way to mitigate or eliminate the huge slowdowns I see when writing 
millions of rows?
- Are there settings I should be using in order to maximize read speeds for 
random reads?
- Is there a way to design our tables to improve the read speeds for the 
initial large batched reads? I was thinking of using a batch ID column that 
could be used to retrieve the data for the initial block. However, future reads 
would need to be done by the object ID, not the batch ID, so it seems to me I’d 
need to duplicate the data, one in a “objects by batch” table, and the other in 
a simple “objects” table. Is there a better approach than this?

Thank you!



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Cassandra as a key/object store for many small (10-60k) files

Reply via email to