Spark is scalable to as many nodes as you want and could be collocated with the data nodes — sstableloader wont be as performant for larger datasets. Although it can be run in parallel on different nodes I don’t believe it to be as fault tolerant.
If you have to do it continuously I would even think about leveraging Kafka as the transport layer and using Kafka Connect. It brings other tooling to get data into Cassandra from a variety of sources. Rahul On Aug 6, 2018, 3:16 PM -0400, srimugunthan dhandapani <srimugunthan.dhandap...@gmail.com>, wrote: > Hi all, > We have data that gets filled into Hive/ presto every few hours. > We want that data to be transferred to cassandra tables. > What are some of the high performance ETL options for transferring data > between hive or presto into cassandra? > > Also does anybody have any performance numbers comparing > - loading data from S3 to cassandra using SStableloader > - and loading data from S3 to cassandra using other means (like spark-api)? > > Thanks, > mugunthan