Re: Any Bulk Load on Large Data Set Advice?

2016-11-17 Thread Jeff Jirsa
.apache.org> Subject: Any Bulk Load on Large Data Set Advice? I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized data set is about 13TB out the door. I plan on

Re: Any Bulk Load on Large Data Set Advice?

2016-11-17 Thread Ben Bromhead
+1 on parquet and S3. Combined with spark running on spot instances your grant money will go much further! On Thu, 17 Nov 2016 at 07:21 Jonathan Haddad wrote: > If you're only doing this for spark, you'll be much better off using > parquet and HDFS or S3. While you *can* do

Re: Any Bulk Load on Large Data Set Advice?

2016-11-17 Thread Jonathan Haddad
If you're only doing this for spark, you'll be much better off using parquet and HDFS or S3. While you *can* do analytics with cassandra, it's not all that great at it. On Thu, Nov 17, 2016 at 6:05 AM Joe Olson wrote: > I received a grant to do some analysis on netflow

Any Bulk Load on Large Data Set Advice?

2016-11-17 Thread Joe Olson
I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized data set is about 13TB out the door. I plan on using 9 Cassandra nodes (replication factor=3) to store the