Re: Suggestions for migrating data from cassandra

2018-05-15 Thread Joseph Arriola
Hi Jing.

How much information do you need to migrate? in volume and number of tables?

With Spark could you do the follow:

   - Read the data and export directly to MySQL.
   - Read the data and export to csv files and after load to MySQL.


Could you use other paths such as:

   - StreamSets
   - Talend Open Studio
   - Kafka Streams.




2018-05-15 4:59 GMT-06:00 Jing Meng :

> Hi guys, for some historical reason, our cassandra cluster is currently
> overloaded and operating on that somehow becomes a nightmare. Anyway,
> (sadly) we're planning to migrate cassandra data back to mysql...
>
> So we're not quite clear how to migrating the historical data from
> cassandra.
>
> While as I know there is the COPY command, I wonder if it works in product
> env where more than hundreds gigabytes data are present. And, if it does,
> would it impact server performance significantly?
>
> Apart from that, I know spark-connector can be used to scan data from c*
> cluster, but I'm not that familiar with spark and still not sure whether
> write data to mysql database can be done naturally with spark-connector.
>
> Are there any suggestions/best-practice/read-materials doing this?
>
> Thanks!
>


Re: Suggestions for migrating data from cassandra

2018-05-15 Thread Arbab Khalil
Both C* and mysql support is available in Spark. For C*,
datastax:spark-cassandra-connector is needed. It is very simple to read and
write data in Spark.
To read C* table use:

df = spark.read.format("org.apache.spark.sql.cassandra")\

.options(keyspace = 'test', table = 'test_table').load()

and to write data to mysql table use:

df.write.format('jdbc').options(
  url='jdbc:mysql://localhost/database_name',
  driver='com.mysql.jdbc.Driver',
  dbtable='DestinationTableName',
  user='your_user_name',
  password='your_password').mode('append').save()

While submitting the spark  program, use the
following command:

bin/spark-submit --packages datastax:spark-cassandra-connector:2.0.7-s_2.11 \

   --jars external/mysql-connector-java-5.1.40-bin.jar \

/path_to_your_program/spark_database.py

It should solve your problem and save your time,


On Tue, May 15, 2018 at 11:04 PM, kurt greaves  wrote:

> COPY might work but over hundreds of gigabytes you'll probably run into
> issues if you're overloaded. If you've got access to Spark that would be an
> efficient way to pull down an entire table and dump it out using the
> spark-cassandra-connector.
>
> On 15 May 2018 at 10:59, Jing Meng  wrote:
>
>> Hi guys, for some historical reason, our cassandra cluster is currently
>> overloaded and operating on that somehow becomes a nightmare. Anyway,
>> (sadly) we're planning to migrate cassandra data back to mysql...
>>
>> So we're not quite clear how to migrating the historical data from
>> cassandra.
>>
>> While as I know there is the COPY command, I wonder if it works in
>> product env where more than hundreds gigabytes data are present. And, if it
>> does, would it impact server performance significantly?
>>
>> Apart from that, I know spark-connector can be used to scan data from c*
>> cluster, but I'm not that familiar with spark and still not sure whether
>> write data to mysql database can be done naturally with spark-connector.
>>
>> Are there any suggestions/best-practice/read-materials doing this?
>>
>> Thanks!
>>
>
>


-- 
Regards,
Arbab Khalil
Software Design Engineer


Re: Suggestions for migrating data from cassandra

2018-05-15 Thread kurt greaves
COPY might work but over hundreds of gigabytes you'll probably run into
issues if you're overloaded. If you've got access to Spark that would be an
efficient way to pull down an entire table and dump it out using the
spark-cassandra-connector.

On 15 May 2018 at 10:59, Jing Meng  wrote:

> Hi guys, for some historical reason, our cassandra cluster is currently
> overloaded and operating on that somehow becomes a nightmare. Anyway,
> (sadly) we're planning to migrate cassandra data back to mysql...
>
> So we're not quite clear how to migrating the historical data from
> cassandra.
>
> While as I know there is the COPY command, I wonder if it works in product
> env where more than hundreds gigabytes data are present. And, if it does,
> would it impact server performance significantly?
>
> Apart from that, I know spark-connector can be used to scan data from c*
> cluster, but I'm not that familiar with spark and still not sure whether
> write data to mysql database can be done naturally with spark-connector.
>
> Are there any suggestions/best-practice/read-materials doing this?
>
> Thanks!
>


Re: Suggestions for migrating data from cassandra

2018-05-15 Thread Michael Dykman
I don't know that there are any projects out there addressing this but I
advise you to study LOAD ... INFILE in the MySQL manual specific to your
target version. It basically describes a CSV format, where a given file
represents a subset of data for a specific table. It is far and away the
fastest method for loading huge amounts of data into MySQL
non-transactionally.

On the downside, you are likely going to have to author your own Cassandra
client tool to generate those files.

On Tue, May 15, 2018, 6:59 AM Jing Meng,  wrote:

> Hi guys, for some historical reason, our cassandra cluster is currently
> overloaded and operating on that somehow becomes a nightmare. Anyway,
> (sadly) we're planning to migrate cassandra data back to mysql...
>
> So we're not quite clear how to migrating the historical data from
> cassandra.
>
> While as I know there is the COPY command, I wonder if it works in product
> env where more than hundreds gigabytes data are present. And, if it does,
> would it impact server performance significantly?
>
> Apart from that, I know spark-connector can be used to scan data from c*
> cluster, but I'm not that familiar with spark and still not sure whether
> write data to mysql database can be done naturally with spark-connector.
>
> Are there any suggestions/best-practice/read-materials doing this?
>
> Thanks!
>


Suggestions for migrating data from cassandra

2018-05-15 Thread Jing Meng
Hi guys, for some historical reason, our cassandra cluster is currently
overloaded and operating on that somehow becomes a nightmare. Anyway,
(sadly) we're planning to migrate cassandra data back to mysql...

So we're not quite clear how to migrating the historical data from
cassandra.

While as I know there is the COPY command, I wonder if it works in product
env where more than hundreds gigabytes data are present. And, if it does,
would it impact server performance significantly?

Apart from that, I know spark-connector can be used to scan data from c*
cluster, but I'm not that familiar with spark and still not sure whether
write data to mysql database can be done naturally with spark-connector.

Are there any suggestions/best-practice/read-materials doing this?

Thanks!


Re: Academic paper about Cassandra database compaction

2018-05-15 Thread Jeff Jirsa
On Mon, May 14, 2018 at 11:04 AM, Lucas Benevides <
lu...@maurobenevides.com.br> wrote:

> Thank you Jeff Jirsa by your comments,
>
> How can we do this:  "fix this by not scheduling the major compaction
> until we know all of the sstables in the window are available to be
> compacted"?
>
>
Would require a change to TWCS itself. Right here where we grab
not-currently-compacting sstables (
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/TimeWindowCompactionStrategy.java#L110
), we'd also grab the compacting set, and if the candidate sstables for the
task overlapped with the same window as the compacting sstables (respecting
repaired/unrepaired/pending-repaired sets), then we'd skip compacting until
the previous compactions finished.


> About the column-family schema, I had to customize the cassandra-stress
> tool so that it could create a reasonable number of rows per partition. In
> the default behavior it keeps creating repeated clustering keys for each
> partition, and so most data get updated instead of inserted.
>

A similar customization may be useful to create partitions that are
narrowly bucketed into fixed sized time windows (which is a common and
typical schema in IOT use cases).

- Jeff





>
> Lucas B. Dias
>
> 2018-05-14 14:03 GMT-03:00 Jeff Jirsa :
>
>> Interesting!
>>
>> I suspect I know what the increased disk usage in TWCS, and it's a
>> solvable problem, the problem is roughly something like this:
>> - Window 1 has sstables 1, 2, 3, 4, 5, 6
>> - We start compacting 1, 2, 3, 4 (using STCS-in-TWCS first window)
>> - The TWCS window rolls over
>> - We flush (sstable 7), and trigger the TWCS window major compaction,
>> which starts compacting 5, 6, 7 + any other sstable from that window
>> - If the first compaction (1,2,3,4) has finished by the time sstable 7 is
>> flushed, we'll include it's result in that compaction, if it doesn't we'll
>> have to do the major compaction twice to guarantee we have exactly one
>> sstable per window, which will temporarily increase disk space
>>
>> We can likely fix this by not scheduling the major compaction until we
>> know all of the sstables in the window are available to be compacted.
>>
>> Also your data model is probably typical, but not well suited for time
>> series cases - if you find my 2016 Cassandra Summit TWCS talk (it's on
>> youtube), I mention aligning partition keys to TWCS windows, which involves
>> adding a second component to the partition key. This is hugely important in
>> terms of making sure TWCS data expires quickly and avoiding having to read
>> from more than one TWCS window at a time.
>>
>>
>> - Jeff
>>
>>
>>
>> On Mon, May 14, 2018 at 7:12 AM, Lucas Benevides <
>> lu...@maurobenevides.com.br> wrote:
>>
>>> Dear community,
>>>
>>> I want to tell you about my paper published in a conference in March.
>>> The title is " NoSQL Database Performance Tuning for IoT Data -
>>> Cassandra Case Study"  and it is available (not for free) in
>>> http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10
>>> .5220/0006782702770284 .
>>>
>>> TWCS is used and compared with DTCS.
>>>
>>> I hope you can download it, unfortunately I cannot send copies as the
>>> publisher has its copyright.
>>>
>>> Lucas B. Dias
>>>
>>>
>>>
>>
>