Re: Suggestions for migrating data from cassandra

2018-05-16 Thread Jing Meng
We would try migration for some small keyspaces (with data of serveral
gigabytes across a dc) first,
but ultimately migration for several large keyspaces with data size ranged
from 100G to 5T, some tables having >1T data, would be scheduled too.

As for StreamSets/Talend, personally I doubt if using that would be
appropriate at our company, as manpower is pretty restricted for this
migration.

Arbab's answer actually resolved my initial concern, now trying to play
with spark-connector.

Thanks for all your replies, much appreciated!

2018-05-16 5:35 GMT+08:00 Joseph Arriola :

> Hi Jing.
>
> How much information do you need to migrate? in volume and number of
> tables?
>
> With Spark could you do the follow:
>
>- Read the data and export directly to MySQL.
>- Read the data and export to csv files and after load to MySQL.
>
>
> Could you use other paths such as:
>
>- StreamSets
>- Talend Open Studio
>- Kafka Streams.
>
>
>
>
> 2018-05-15 4:59 GMT-06:00 Jing Meng :
>
>> Hi guys, for some historical reason, our cassandra cluster is currently
>> overloaded and operating on that somehow becomes a nightmare. Anyway,
>> (sadly) we're planning to migrate cassandra data back to mysql...
>>
>> So we're not quite clear how to migrating the historical data from
>> cassandra.
>>
>> While as I know there is the COPY command, I wonder if it works in
>> product env where more than hundreds gigabytes data are present. And, if it
>> does, would it impact server performance significantly?
>>
>> Apart from that, I know spark-connector can be used to scan data from c*
>> cluster, but I'm not that familiar with spark and still not sure whether
>> write data to mysql database can be done naturally with spark-connector.
>>
>> Are there any suggestions/best-practice/read-materials doing this?
>>
>> Thanks!
>>
>
>


Re: Suggestions for migrating data from cassandra

2018-05-15 Thread Joseph Arriola
Hi Jing.

How much information do you need to migrate? in volume and number of tables?

With Spark could you do the follow:

   - Read the data and export directly to MySQL.
   - Read the data and export to csv files and after load to MySQL.


Could you use other paths such as:

   - StreamSets
   - Talend Open Studio
   - Kafka Streams.




2018-05-15 4:59 GMT-06:00 Jing Meng :

> Hi guys, for some historical reason, our cassandra cluster is currently
> overloaded and operating on that somehow becomes a nightmare. Anyway,
> (sadly) we're planning to migrate cassandra data back to mysql...
>
> So we're not quite clear how to migrating the historical data from
> cassandra.
>
> While as I know there is the COPY command, I wonder if it works in product
> env where more than hundreds gigabytes data are present. And, if it does,
> would it impact server performance significantly?
>
> Apart from that, I know spark-connector can be used to scan data from c*
> cluster, but I'm not that familiar with spark and still not sure whether
> write data to mysql database can be done naturally with spark-connector.
>
> Are there any suggestions/best-practice/read-materials doing this?
>
> Thanks!
>


Re: Suggestions for migrating data from cassandra

2018-05-15 Thread Arbab Khalil
Both C* and mysql support is available in Spark. For C*,
datastax:spark-cassandra-connector is needed. It is very simple to read and
write data in Spark.
To read C* table use:

df = spark.read.format("org.apache.spark.sql.cassandra")\

.options(keyspace = 'test', table = 'test_table').load()

and to write data to mysql table use:

df.write.format('jdbc').options(
  url='jdbc:mysql://localhost/database_name',
  driver='com.mysql.jdbc.Driver',
  dbtable='DestinationTableName',
  user='your_user_name',
  password='your_password').mode('append').save()

While submitting the spark  program, use the
following command:

bin/spark-submit --packages datastax:spark-cassandra-connector:2.0.7-s_2.11 \

   --jars external/mysql-connector-java-5.1.40-bin.jar \

/path_to_your_program/spark_database.py

It should solve your problem and save your time,


On Tue, May 15, 2018 at 11:04 PM, kurt greaves  wrote:

> COPY might work but over hundreds of gigabytes you'll probably run into
> issues if you're overloaded. If you've got access to Spark that would be an
> efficient way to pull down an entire table and dump it out using the
> spark-cassandra-connector.
>
> On 15 May 2018 at 10:59, Jing Meng  wrote:
>
>> Hi guys, for some historical reason, our cassandra cluster is currently
>> overloaded and operating on that somehow becomes a nightmare. Anyway,
>> (sadly) we're planning to migrate cassandra data back to mysql...
>>
>> So we're not quite clear how to migrating the historical data from
>> cassandra.
>>
>> While as I know there is the COPY command, I wonder if it works in
>> product env where more than hundreds gigabytes data are present. And, if it
>> does, would it impact server performance significantly?
>>
>> Apart from that, I know spark-connector can be used to scan data from c*
>> cluster, but I'm not that familiar with spark and still not sure whether
>> write data to mysql database can be done naturally with spark-connector.
>>
>> Are there any suggestions/best-practice/read-materials doing this?
>>
>> Thanks!
>>
>
>


-- 
Regards,
Arbab Khalil
Software Design Engineer


Re: Suggestions for migrating data from cassandra

2018-05-15 Thread kurt greaves
COPY might work but over hundreds of gigabytes you'll probably run into
issues if you're overloaded. If you've got access to Spark that would be an
efficient way to pull down an entire table and dump it out using the
spark-cassandra-connector.

On 15 May 2018 at 10:59, Jing Meng  wrote:

> Hi guys, for some historical reason, our cassandra cluster is currently
> overloaded and operating on that somehow becomes a nightmare. Anyway,
> (sadly) we're planning to migrate cassandra data back to mysql...
>
> So we're not quite clear how to migrating the historical data from
> cassandra.
>
> While as I know there is the COPY command, I wonder if it works in product
> env where more than hundreds gigabytes data are present. And, if it does,
> would it impact server performance significantly?
>
> Apart from that, I know spark-connector can be used to scan data from c*
> cluster, but I'm not that familiar with spark and still not sure whether
> write data to mysql database can be done naturally with spark-connector.
>
> Are there any suggestions/best-practice/read-materials doing this?
>
> Thanks!
>


Re: Suggestions for migrating data from cassandra

2018-05-15 Thread Michael Dykman
I don't know that there are any projects out there addressing this but I
advise you to study LOAD ... INFILE in the MySQL manual specific to your
target version. It basically describes a CSV format, where a given file
represents a subset of data for a specific table. It is far and away the
fastest method for loading huge amounts of data into MySQL
non-transactionally.

On the downside, you are likely going to have to author your own Cassandra
client tool to generate those files.

On Tue, May 15, 2018, 6:59 AM Jing Meng,  wrote:

> Hi guys, for some historical reason, our cassandra cluster is currently
> overloaded and operating on that somehow becomes a nightmare. Anyway,
> (sadly) we're planning to migrate cassandra data back to mysql...
>
> So we're not quite clear how to migrating the historical data from
> cassandra.
>
> While as I know there is the COPY command, I wonder if it works in product
> env where more than hundreds gigabytes data are present. And, if it does,
> would it impact server performance significantly?
>
> Apart from that, I know spark-connector can be used to scan data from c*
> cluster, but I'm not that familiar with spark and still not sure whether
> write data to mysql database can be done naturally with spark-connector.
>
> Are there any suggestions/best-practice/read-materials doing this?
>
> Thanks!
>


Suggestions for migrating data from cassandra

2018-05-15 Thread Jing Meng
Hi guys, for some historical reason, our cassandra cluster is currently
overloaded and operating on that somehow becomes a nightmare. Anyway,
(sadly) we're planning to migrate cassandra data back to mysql...

So we're not quite clear how to migrating the historical data from
cassandra.

While as I know there is the COPY command, I wonder if it works in product
env where more than hundreds gigabytes data are present. And, if it does,
would it impact server performance significantly?

Apart from that, I know spark-connector can be used to scan data from c*
cluster, but I'm not that familiar with spark and still not sure whether
write data to mysql database can be done naturally with spark-connector.

Are there any suggestions/best-practice/read-materials doing this?

Thanks!