Interesting Results - Cassandra Benchmarks over Time Series Data for IoT Use Case I

2018-05-17 Thread Arbab Khalil
 We have been exploring IoT specific C* schema design over the past few
months. We wanted to share the benchmarking results with the wider
community for a) bringing rigor to the discussion, and b) starting a
discussion for better design.

First the use-case: We have time-series of data from devices on several
sites, where each device (with a unique dev_id) can have several sensors
attached to it. Most queries however are both time limited as well as over
a range of dev_ids, even for a single sensor (Multi-sensor joins are a
whole different beast for another day!). We want to have a schema where the
query can complete in time linear to the query ranges for both devices and
time range, immaterial (largely) to the total data size.


So we explored several different primary key definitions, learning from the
best-practices communicated on this mailing list and over the interwebs.
While details about the setup (Spark over C*) and schema are in a companion
blog/site here [1], we just mention the primary keys and the key points
here.


   1.

   PRIMARY KEY (dev_id, day, rec_time)
   2.

   PRIMARY KEY ((dev_id, rec_time)
   3.

   PRIMARY KEY (day, dev_id, rec_time)
   4.

   PRIMARY KEY ((day, dev_id), rec_time)
   5.

   PRIMARY KEY ((dev_id, day), rec_time)
   6.

   Combination of above by adding a year field in the schema.


The main takeaway (again, please read through the details at [1]) is that
we really don't have a single schema to answer the use case above without
some drawback. Thus while the ((day, dev_id), rec_time) gives a constant
response, it is dependent entirely on the total data size (full scan). On
the other hand, (dev_id, day, rec_time) and its counterpart (day,
dev_id, rec_time)
provide acceptable results, we have the issue of very large partition space
in the first, and hotspot while writing for the latter case.

We also observed that having a multi-field partition key allows for fast
querying only if the "=" is used going left to right. If an IN() (for
specifying eg. range of time or list of devices) is used once that order,
than any further usage of IN() removes any benefit (i.e. a near full table
scan).
Another useful learning was that using the IN() to query for days is less
useful than putting in a range query.

Currently, it seems we are in a bind --- should we use a different data
store for our usecase (which seems quite typical for IoT)? Something like
HDFS or Parquet? We would love to get feedback on the benchmarking results
and how we can possibly improve this and share widely.
[1] Cassandra Benchmarks over Time Series Data for IoT Use Case
<https://sites.google.com/an10.io/timeseries-results>
   https://sites.google.com/an10.io/timeseries-results


-- 
Regards,
Arbab Khalil
Software Design Engineer


Re: Suggestions for migrating data from cassandra

2018-05-15 Thread Arbab Khalil
Both C* and mysql support is available in Spark. For C*,
datastax:spark-cassandra-connector is needed. It is very simple to read and
write data in Spark.
To read C* table use:

df = spark.read.format("org.apache.spark.sql.cassandra")\

.options(keyspace = 'test', table = 'test_table').load()

and to write data to mysql table use:

df.write.format('jdbc').options(
  url='jdbc:mysql://localhost/database_name',
  driver='com.mysql.jdbc.Driver',
  dbtable='DestinationTableName',
  user='your_user_name',
  password='your_password').mode('append').save()

While submitting the spark <http://spark.apache.org/> program, use the
following command:

bin/spark-submit --packages datastax:spark-cassandra-connector:2.0.7-s_2.11 \

   --jars external/mysql-connector-java-5.1.40-bin.jar \

/path_to_your_program/spark_database.py

It should solve your problem and save your time,


On Tue, May 15, 2018 at 11:04 PM, kurt greaves <k...@instaclustr.com> wrote:

> COPY might work but over hundreds of gigabytes you'll probably run into
> issues if you're overloaded. If you've got access to Spark that would be an
> efficient way to pull down an entire table and dump it out using the
> spark-cassandra-connector.
>
> On 15 May 2018 at 10:59, Jing Meng <self.rel...@gmail.com> wrote:
>
>> Hi guys, for some historical reason, our cassandra cluster is currently
>> overloaded and operating on that somehow becomes a nightmare. Anyway,
>> (sadly) we're planning to migrate cassandra data back to mysql...
>>
>> So we're not quite clear how to migrating the historical data from
>> cassandra.
>>
>> While as I know there is the COPY command, I wonder if it works in
>> product env where more than hundreds gigabytes data are present. And, if it
>> does, would it impact server performance significantly?
>>
>> Apart from that, I know spark-connector can be used to scan data from c*
>> cluster, but I'm not that familiar with spark and still not sure whether
>> write data to mysql database can be done naturally with spark-connector.
>>
>> Are there any suggestions/best-practice/read-materials doing this?
>>
>> Thanks!
>>
>
>


-- 
Regards,
Arbab Khalil
Software Design Engineer