Re: migration to a new model

2014-06-05 Thread Marcelo Elias Del Valle
Michael,

I will try to test it up to tomorrow and I will let you know all the
results.

Thanks a lot!

Best regards,
Marcelo.


2014-06-04 22:28 GMT-03:00 Laing, Michael michael.la...@nytimes.com:

 BTW you might want to put a LIMIT clause on your SELECT for testing. -ml


 On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael michael.la...@nytimes.com
 wrote:

 Marcelo,

 Here is a link to the preview of the python fast copy program:

 https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47

 It will copy a table from one cluster to another with some
 transformation- they can be the same cluster.

 It has 3 main throttles to experiment with:

1. fetch_size: size of source pages in rows
2. worker_count: number of worker subprocesses
3. concurrency: number of async callback chains per worker subprocess

 It is easy to overrun Cassandra and the python driver, so I recommend
 starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency:
 10.

 Additionally there are switches to set 'policies' by source and
 destination: retry (downgrade consistency), dc_aware, and token_aware.
 retry is useful if you are getting timeouts. For the others YMMV.

 To use it you need to define the SELECT and UPDATE cql statements as well
 as the 'map_fields' method.

 The worker subprocesses divide up the token range among themselves and
 proceed quasi-independently. Each worker opens a connection to each cluster
 and the driver sets up connection pools to the nodes in the cluster. Anyway
 there are a lot of processes, threads, callbacks going at once so it is fun
 to watch.

 On my regional cluster of small nodes in AWS I got about 3000 rows per
 second transferred after things warmed up a bit - each row about 6kb.

 ml


 On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 OK Marcelo, I'll work on it today. -ml


 On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi Michael,

 For sure I would be interested in this program!

 I am new both to python and for cql. I started creating this copier,
 but was having problems with timeouts. Alex solved my problem here on the
 list, but I think I will still have a lot of trouble making the copy to
 work fine.

 I open sourced my version here:
 https://github.com/s1mbi0se/cql_record_processor

 Just in case it's useful for anything.

 However, I saw CQL has support for concurrency itself and having
 something made by someone who knows Python CQL Driver better would be very
 helpful.

 My two servers today are at OVH (ovh.com), we have servers at AWS but
 but several cases we prefer other hosts. Both servers have SDD and 64 Gb
 RAM, I could use the script as a benchmark for you if you want. Besides, we
 have some bigger clusters, I could run on the just to test the speed if
 this is going to help.

 Regards
 Marcelo.


 2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com:

 Hi Marcelo,

 I could create a fast copy program by repurposing some python apps
 that I am using for benchmarking the python driver - do you still need 
 this?

 With high levels of concurrency and multiple subprocess workers, based
 on my current actual benchmarks, I think I can get well over 1,000
 rows/second on my mac and significantly more in AWS. I'm using variable
 size rows averaging 5kb.

 This would be the initial version of a piece of the benchmark suite we
 will release as part of our nyt⨍aбrik project on 21 June for my
 Cassandra Day NYC talk re the python driver.

 ml


 On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi Jens,

 Thanks for trying to help.

 Indeed, I know I can't do it using just CQL. But what would you use
 to migrate data manually? I tried to create a python program using auto
 paging, but I am getting timeouts. I also tried Hive, but no success.
 I only have two nodes and less than 200Gb in this cluster, any simple
 way to extract the data quickly would be good enough for me.

 Best regards,
 Marcelo.



 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se:

 Hi Marcelo,

 Looks like you can't do this without migrating your data manually:
 https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql

 Cheers,
 Jens


 On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi,

 I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.

 I realized I created my column family with the wrong partition.
 Instead of:

 CREATE TABLE IF NOT EXISTS entity_lookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY ((name, value), entity_id))
 WITH
 caching=all;

 I used:

 CREATE TABLE IF NOT EXISTS entitylookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY (name, value, entity_id))
 WITH
 caching=all;


 Now I need to migrate the data from the second CF to the first one.
 I am using Data Stax 

Re: migration to a new model

2014-06-04 Thread Laing, Michael
OK Marcelo, I'll work on it today. -ml


On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 Hi Michael,

 For sure I would be interested in this program!

 I am new both to python and for cql. I started creating this copier, but
 was having problems with timeouts. Alex solved my problem here on the list,
 but I think I will still have a lot of trouble making the copy to work fine.

 I open sourced my version here:
 https://github.com/s1mbi0se/cql_record_processor

 Just in case it's useful for anything.

 However, I saw CQL has support for concurrency itself and having something
 made by someone who knows Python CQL Driver better would be very helpful.

 My two servers today are at OVH (ovh.com), we have servers at AWS but but
 several cases we prefer other hosts. Both servers have SDD and 64 Gb RAM, I
 could use the script as a benchmark for you if you want. Besides, we have
 some bigger clusters, I could run on the just to test the speed if this is
 going to help.

 Regards
 Marcelo.


 2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com:

 Hi Marcelo,

 I could create a fast copy program by repurposing some python apps that I
 am using for benchmarking the python driver - do you still need this?

 With high levels of concurrency and multiple subprocess workers, based on
 my current actual benchmarks, I think I can get well over 1,000 rows/second
 on my mac and significantly more in AWS. I'm using variable size rows
 averaging 5kb.

 This would be the initial version of a piece of the benchmark suite we
 will release as part of our nyt⨍aбrik project on 21 June for my
 Cassandra Day NYC talk re the python driver.

 ml


 On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi Jens,

 Thanks for trying to help.

 Indeed, I know I can't do it using just CQL. But what would you use to
 migrate data manually? I tried to create a python program using auto
 paging, but I am getting timeouts. I also tried Hive, but no success.
 I only have two nodes and less than 200Gb in this cluster, any simple
 way to extract the data quickly would be good enough for me.

 Best regards,
 Marcelo.



 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se:

 Hi Marcelo,

 Looks like you can't do this without migrating your data manually:
 https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql

 Cheers,
 Jens


 On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi,

 I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.

 I realized I created my column family with the wrong partition.
 Instead of:

 CREATE TABLE IF NOT EXISTS entity_lookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY ((name, value), entity_id))
 WITH
 caching=all;

 I used:

 CREATE TABLE IF NOT EXISTS entitylookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY (name, value, entity_id))
 WITH
 caching=all;


 Now I need to migrate the data from the second CF to the first one.
 I am using Data Stax Community Edition.

 What would be the best way to convert data from one CF to the other?

 Best regards,
 Marcelo.








Re: migration to a new model

2014-06-04 Thread Laing, Michael
BTW you might want to put a LIMIT clause on your SELECT for testing. -ml


On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael michael.la...@nytimes.com
wrote:

 Marcelo,

 Here is a link to the preview of the python fast copy program:

 https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47

 It will copy a table from one cluster to another with some transformation-
 they can be the same cluster.

 It has 3 main throttles to experiment with:

1. fetch_size: size of source pages in rows
2. worker_count: number of worker subprocesses
3. concurrency: number of async callback chains per worker subprocess

 It is easy to overrun Cassandra and the python driver, so I recommend
 starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency:
 10.

 Additionally there are switches to set 'policies' by source and
 destination: retry (downgrade consistency), dc_aware, and token_aware.
 retry is useful if you are getting timeouts. For the others YMMV.

 To use it you need to define the SELECT and UPDATE cql statements as well
 as the 'map_fields' method.

 The worker subprocesses divide up the token range among themselves and
 proceed quasi-independently. Each worker opens a connection to each cluster
 and the driver sets up connection pools to the nodes in the cluster. Anyway
 there are a lot of processes, threads, callbacks going at once so it is fun
 to watch.

 On my regional cluster of small nodes in AWS I got about 3000 rows per
 second transferred after things warmed up a bit - each row about 6kb.

 ml


 On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael michael.la...@nytimes.com
  wrote:

 OK Marcelo, I'll work on it today. -ml


 On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi Michael,

 For sure I would be interested in this program!

 I am new both to python and for cql. I started creating this copier, but
 was having problems with timeouts. Alex solved my problem here on the list,
 but I think I will still have a lot of trouble making the copy to work fine.

 I open sourced my version here:
 https://github.com/s1mbi0se/cql_record_processor

 Just in case it's useful for anything.

 However, I saw CQL has support for concurrency itself and having
 something made by someone who knows Python CQL Driver better would be very
 helpful.

 My two servers today are at OVH (ovh.com), we have servers at AWS but
 but several cases we prefer other hosts. Both servers have SDD and 64 Gb
 RAM, I could use the script as a benchmark for you if you want. Besides, we
 have some bigger clusters, I could run on the just to test the speed if
 this is going to help.

 Regards
 Marcelo.


 2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com:

 Hi Marcelo,

 I could create a fast copy program by repurposing some python apps that
 I am using for benchmarking the python driver - do you still need this?

 With high levels of concurrency and multiple subprocess workers, based
 on my current actual benchmarks, I think I can get well over 1,000
 rows/second on my mac and significantly more in AWS. I'm using variable
 size rows averaging 5kb.

 This would be the initial version of a piece of the benchmark suite we
 will release as part of our nyt⨍aбrik project on 21 June for my
 Cassandra Day NYC talk re the python driver.

 ml


 On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi Jens,

 Thanks for trying to help.

 Indeed, I know I can't do it using just CQL. But what would you use to
 migrate data manually? I tried to create a python program using auto
 paging, but I am getting timeouts. I also tried Hive, but no success.
 I only have two nodes and less than 200Gb in this cluster, any simple
 way to extract the data quickly would be good enough for me.

 Best regards,
 Marcelo.



 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se:

 Hi Marcelo,

 Looks like you can't do this without migrating your data manually:
 https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql

 Cheers,
 Jens


 On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi,

 I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.

 I realized I created my column family with the wrong partition.
 Instead of:

 CREATE TABLE IF NOT EXISTS entity_lookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY ((name, value), entity_id))
 WITH
 caching=all;

 I used:

 CREATE TABLE IF NOT EXISTS entitylookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY (name, value, entity_id))
 WITH
 caching=all;


 Now I need to migrate the data from the second CF to the first one.
 I am using Data Stax Community Edition.

 What would be the best way to convert data from one CF to the other?

 Best regards,
 Marcelo.










Re: migration to a new model

2014-06-03 Thread Laing, Michael
Hi Marcelo,

I could create a fast copy program by repurposing some python apps that I
am using for benchmarking the python driver - do you still need this?

With high levels of concurrency and multiple subprocess workers, based on
my current actual benchmarks, I think I can get well over 1,000 rows/second
on my mac and significantly more in AWS. I'm using variable size rows
averaging 5kb.

This would be the initial version of a piece of the benchmark suite we will
release as part of our nyt⨍aбrik project on 21 June for my Cassandra Day
NYC talk re the python driver.

ml


On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 Hi Jens,

 Thanks for trying to help.

 Indeed, I know I can't do it using just CQL. But what would you use to
 migrate data manually? I tried to create a python program using auto
 paging, but I am getting timeouts. I also tried Hive, but no success.
 I only have two nodes and less than 200Gb in this cluster, any simple way
 to extract the data quickly would be good enough for me.

 Best regards,
 Marcelo.



 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se:

 Hi Marcelo,

 Looks like you can't do this without migrating your data manually:
 https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql

 Cheers,
 Jens


 On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi,

 I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.

 I realized I created my column family with the wrong partition. Instead
 of:

 CREATE TABLE IF NOT EXISTS entity_lookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY ((name, value), entity_id))
 WITH
 caching=all;

 I used:

 CREATE TABLE IF NOT EXISTS entitylookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY (name, value, entity_id))
 WITH
 caching=all;


 Now I need to migrate the data from the second CF to the first one.
 I am using Data Stax Community Edition.

 What would be the best way to convert data from one CF to the other?

 Best regards,
 Marcelo.






Re: migration to a new model

2014-06-02 Thread Jens Rantil
Hi Marcelo,

Looks like you can't do this without migrating your data manually:
https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql

Cheers,
Jens


On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 Hi,

 I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.

 I realized I created my column family with the wrong partition. Instead of:

 CREATE TABLE IF NOT EXISTS entity_lookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY ((name, value), entity_id))
 WITH
 caching=all;

 I used:

 CREATE TABLE IF NOT EXISTS entitylookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY (name, value, entity_id))
 WITH
 caching=all;


 Now I need to migrate the data from the second CF to the first one.
 I am using Data Stax Community Edition.

 What would be the best way to convert data from one CF to the other?

 Best regards,
 Marcelo.



Re: migration to a new model

2014-06-02 Thread Marcelo Elias Del Valle
Hi Jens,

Thanks for trying to help.

Indeed, I know I can't do it using just CQL. But what would you use to
migrate data manually? I tried to create a python program using auto
paging, but I am getting timeouts. I also tried Hive, but no success.
I only have two nodes and less than 200Gb in this cluster, any simple way
to extract the data quickly would be good enough for me.

Best regards,
Marcelo.



2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se:

 Hi Marcelo,

 Looks like you can't do this without migrating your data manually:
 https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql

 Cheers,
 Jens


 On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Hi,

 I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.

 I realized I created my column family with the wrong partition. Instead
 of:

 CREATE TABLE IF NOT EXISTS entity_lookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY ((name, value), entity_id))
 WITH
 caching=all;

 I used:

 CREATE TABLE IF NOT EXISTS entitylookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY (name, value, entity_id))
 WITH
 caching=all;


 Now I need to migrate the data from the second CF to the first one.
 I am using Data Stax Community Edition.

 What would be the best way to convert data from one CF to the other?

 Best regards,
 Marcelo.