Re: Any Bulk Load on Large Data Set Advice?

2016-11-17 Thread Jeff Jirsa
Other people are commenting on the appropriateness of Cassandra – they may have 
a point you should consider, but I’m going to answer the question. 

 

1)   Yes, you can generate the sstables in parallel

2)   If you use sstable bulk loader interface (sstableloader), it’ll stream 
to all appropriate nodes. You can run sstableloader from multiple nodes at the 
same time as well. 

3)   Sorting by partition key probably won’t hurt. If you run jobs in 
parallel, dividing them up by partition key seems like a good way to 
parallelize your task. 

 

We do something like this in certain parts of our workflow, and it works well.  

 

 

 

From: Joe Olson <technol...@nododos.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thursday, November 17, 2016 at 5:58 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Any Bulk Load on Large Data Set Advice?

 

I received a grant to do some analysis on netflow data (Local IP address, Local 
Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra 
and Spark. The de-normalized data set is about 13TB out the door. I plan on 
using 9 Cassandra nodes (replication factor=3) to store the data, with Spark 
doing the aggregation. 

 

Data set will be immutable once loaded, and am using the replication factor = 3 
to somewhat simulate the real world. Most of the analysis will be of the sort 
"Give me all the remote ip addresses for source IP 'X' between time t1 and t2"

 

I built and tested a bulk loader following this example in GitHub: 
https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, 
but I have not executed it on the entire data set yet.

 

Any advice on how to execute the bulk load under this configuration?  Can I 
generate the SSTables in parallel? Once generated, can I write the SSTables to 
all nodes simultaneously? Should I be doing any kind of sorting by the 
partition key?

 

This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks 
in advance!

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: Any Bulk Load on Large Data Set Advice?

2016-11-17 Thread Ben Bromhead
+1 on parquet and S3.

Combined with spark running on spot instances your grant money will go much
further!

On Thu, 17 Nov 2016 at 07:21 Jonathan Haddad  wrote:

> If you're only doing this for spark, you'll be much better off using
> parquet and HDFS or S3. While you *can* do analytics with cassandra, it's
> not all that great at it.
> On Thu, Nov 17, 2016 at 6:05 AM Joe Olson  wrote:
>
> I received a grant to do some analysis on netflow data (Local IP address,
> Local Port, Remote IP address, Remote Port, time, # of packets, etc) using
> Cassandra and Spark. The de-normalized data set is about 13TB out the door.
> I plan on using 9 Cassandra nodes (replication factor=3) to store the data,
> with Spark doing the aggregation.
>
> Data set will be immutable once loaded, and am using the replication
> factor = 3 to somewhat simulate the real world. Most of the analysis will
> be of the sort "Give me all the remote ip addresses for source IP 'X'
> between time t1 and t2"
>
> I built and tested a bulk loader following this example in GitHub:
> https://github.com/yukim/cassandra-bulkload-example to generate the
> SSTables, but I have not executed it on the entire data set yet.
>
> Any advice on how to execute the bulk load under this configuration?  Can
> I generate the SSTables in parallel? Once generated, can I write the
> SSTables to all nodes simultaneously? Should I be doing any kind of sorting
> by the partition key?
>
> This is a lot of data, so I figured I'd ask before I pulled the trigger.
> Thanks in advance!
>
>
> --
Ben Bromhead
CTO | Instaclustr 
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer


Re: Any Bulk Load on Large Data Set Advice?

2016-11-17 Thread Jonathan Haddad
If you're only doing this for spark, you'll be much better off using
parquet and HDFS or S3. While you *can* do analytics with cassandra, it's
not all that great at it.
On Thu, Nov 17, 2016 at 6:05 AM Joe Olson  wrote:

> I received a grant to do some analysis on netflow data (Local IP address,
> Local Port, Remote IP address, Remote Port, time, # of packets, etc) using
> Cassandra and Spark. The de-normalized data set is about 13TB out the door.
> I plan on using 9 Cassandra nodes (replication factor=3) to store the data,
> with Spark doing the aggregation.
>
> Data set will be immutable once loaded, and am using the replication
> factor = 3 to somewhat simulate the real world. Most of the analysis will
> be of the sort "Give me all the remote ip addresses for source IP 'X'
> between time t1 and t2"
>
> I built and tested a bulk loader following this example in GitHub:
> https://github.com/yukim/cassandra-bulkload-example to generate the
> SSTables, but I have not executed it on the entire data set yet.
>
> Any advice on how to execute the bulk load under this configuration?  Can
> I generate the SSTables in parallel? Once generated, can I write the
> SSTables to all nodes simultaneously? Should I be doing any kind of sorting
> by the partition key?
>
> This is a lot of data, so I figured I'd ask before I pulled the trigger.
> Thanks in advance!
>
>
>


Any Bulk Load on Large Data Set Advice?

2016-11-17 Thread Joe Olson
I received a grant to do some analysis on netflow data (Local IP address, Local 
Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra 
and Spark. The de-normalized data set is about 13TB out the door. I plan on 
using 9 Cassandra nodes (replication factor=3) to store the data, with Spark 
doing the aggregation. 

Data set will be immutable once loaded, and am using the replication factor = 3 
to somewhat simulate the real world. Most of the analysis will be of the sort 
"Give me all the remote ip addresses for source IP 'X' between time t1 and t2" 

I built and tested a bulk loader following this example in GitHub: 
https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, 
but I have not executed it on the entire data set yet. 

Any advice on how to execute the bulk load under this configuration? Can I 
generate the SSTables in parallel? Once generated, can I write the SSTables to 
all nodes simultaneously? Should I be doing any kind of sorting by the 
partition key? 

This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks 
in advance!