.apache.org>
Subject: Any Bulk Load on Large Data Set Advice?
I received a grant to do some analysis on netflow data (Local IP address, Local
Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra
and Spark. The de-normalized data set is about 13TB out the door. I plan on
+1 on parquet and S3.
Combined with spark running on spot instances your grant money will go much
further!
On Thu, 17 Nov 2016 at 07:21 Jonathan Haddad wrote:
> If you're only doing this for spark, you'll be much better off using
> parquet and HDFS or S3. While you *can* do
If you're only doing this for spark, you'll be much better off using
parquet and HDFS or S3. While you *can* do analytics with cassandra, it's
not all that great at it.
On Thu, Nov 17, 2016 at 6:05 AM Joe Olson wrote:
> I received a grant to do some analysis on netflow
I received a grant to do some analysis on netflow data (Local IP address, Local
Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra
and Spark. The de-normalized data set is about 13TB out the door. I plan on
using 9 Cassandra nodes (replication factor=3) to store the