I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized data set is about 13TB out the door. I plan on using 9 Cassandra nodes (replication factor=3) to store the data, with Spark doing the aggregation.
Data set will be immutable once loaded, and am using the replication factor = 3 to somewhat simulate the real world. Most of the analysis will be of the sort "Give me all the remote ip addresses for source IP 'X' between time t1 and t2" I built and tested a bulk loader following this example in GitHub: https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, but I have not executed it on the entire data set yet. Any advice on how to execute the bulk load under this configuration? Can I generate the SSTables in parallel? Once generated, can I write the SSTables to all nodes simultaneously? Should I be doing any kind of sorting by the partition key? This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks in advance!