Hey,

we are trying Cassandra as an alternative for storage huge stream of data
coming from our customers.

Storing works quite fine, and I started to validate how retrieval does. We
have two types of that: fetching specific records and bulk retrieval for
general analysis.
Fetching single record works like charm. But it is not so with bulk fetch.

With a moderately small table of ~2 million records, ~10Gb raw data I
observed very slow operation (using token(partition key) ranges). It takes
minutes to perform full retrieval. We tried a couple of configurations
using virtual machines, real hardware and overall looks like it is not
possible to all table data in a reasonable time (by reasonable I mean that
since we have 1Gbit network 10Gb can be transferred in a couple of minutes
from one server to another and when we have 10+ cassandra servers and 10+
spark executors total time should be even smaller).

I tried datastax spark connector. Also I wrote a simple test case using
datastax java driver and see how fetch of 10k records takes ~10s so I
assume that "sequential" scan will take 200x more time, equals ~30 minutes.

May be we are totally wrong trying to use Cassandra this way?

-- 

Best Regards,


*Alexander Kotelnikov*

*Team Lead*

DIGINETICA
Retail Technology Company

m: +7.921.915.06.28

*www.diginetica.com <http://www.diginetica.com/>*

Reply via email to