+1 on parquet and S3.

Combined with spark running on spot instances your grant money will go much
further!

On Thu, 17 Nov 2016 at 07:21 Jonathan Haddad <j...@jonhaddad.com> wrote:

> If you're only doing this for spark, you'll be much better off using
> parquet and HDFS or S3. While you *can* do analytics with cassandra, it's
> not all that great at it.
> On Thu, Nov 17, 2016 at 6:05 AM Joe Olson <technol...@nododos.com> wrote:
>
> I received a grant to do some analysis on netflow data (Local IP address,
> Local Port, Remote IP address, Remote Port, time, # of packets, etc) using
> Cassandra and Spark. The de-normalized data set is about 13TB out the door.
> I plan on using 9 Cassandra nodes (replication factor=3) to store the data,
> with Spark doing the aggregation.
>
> Data set will be immutable once loaded, and am using the replication
> factor = 3 to somewhat simulate the real world. Most of the analysis will
> be of the sort "Give me all the remote ip addresses for source IP 'X'
> between time t1 and t2"
>
> I built and tested a bulk loader following this example in GitHub:
> https://github.com/yukim/cassandra-bulkload-example to generate the
> SSTables, but I have not executed it on the entire data set yet.
>
> Any advice on how to execute the bulk load under this configuration?  Can
> I generate the SSTables in parallel? Once generated, can I write the
> SSTables to all nodes simultaneously? Should I be doing any kind of sorting
> by the partition key?
>
> This is a lot of data, so I figured I'd ask before I pulled the trigger.
> Thanks in advance!
>
>
> --
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

Reply via email to