I received a grant to do some analysis on netflow data (Local IP address, Local 
Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra 
and Spark. The de-normalized data set is about 13TB out the door. I plan on 
using 9 Cassandra nodes (replication factor=3) to store the data, with Spark 
doing the aggregation. 

Data set will be immutable once loaded, and am using the replication factor = 3 
to somewhat simulate the real world. Most of the analysis will be of the sort 
"Give me all the remote ip addresses for source IP 'X' between time t1 and t2" 

I built and tested a bulk loader following this example in GitHub: 
https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, 
but I have not executed it on the entire data set yet. 

Any advice on how to execute the bulk load under this configuration? Can I 
generate the SSTables in parallel? Once generated, can I write the SSTables to 
all nodes simultaneously? Should I be doing any kind of sorting by the 
partition key? 

This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks 
in advance! 


Reply via email to