Re: migration to a new model
Michael, I will try to test it up to tomorrow and I will let you know all the results. Thanks a lot! Best regards, Marcelo. 2014-06-04 22:28 GMT-03:00 Laing, Michael michael.la...@nytimes.com: BTW you might want to put a LIMIT clause on your SELECT for testing. -ml On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael michael.la...@nytimes.com wrote: Marcelo, Here is a link to the preview of the python fast copy program: https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 It will copy a table from one cluster to another with some transformation- they can be the same cluster. It has 3 main throttles to experiment with: 1. fetch_size: size of source pages in rows 2. worker_count: number of worker subprocesses 3. concurrency: number of async callback chains per worker subprocess It is easy to overrun Cassandra and the python driver, so I recommend starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency: 10. Additionally there are switches to set 'policies' by source and destination: retry (downgrade consistency), dc_aware, and token_aware. retry is useful if you are getting timeouts. For the others YMMV. To use it you need to define the SELECT and UPDATE cql statements as well as the 'map_fields' method. The worker subprocesses divide up the token range among themselves and proceed quasi-independently. Each worker opens a connection to each cluster and the driver sets up connection pools to the nodes in the cluster. Anyway there are a lot of processes, threads, callbacks going at once so it is fun to watch. On my regional cluster of small nodes in AWS I got about 3000 rows per second transferred after things warmed up a bit - each row about 6kb. ml On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael michael.la...@nytimes.com wrote: OK Marcelo, I'll work on it today. -ml On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Michael, For sure I would be interested in this program! I am new both to python and for cql. I started creating this copier, but was having problems with timeouts. Alex solved my problem here on the list, but I think I will still have a lot of trouble making the copy to work fine. I open sourced my version here: https://github.com/s1mbi0se/cql_record_processor Just in case it's useful for anything. However, I saw CQL has support for concurrency itself and having something made by someone who knows Python CQL Driver better would be very helpful. My two servers today are at OVH (ovh.com), we have servers at AWS but but several cases we prefer other hosts. Both servers have SDD and 64 Gb RAM, I could use the script as a benchmark for you if you want. Besides, we have some bigger clusters, I could run on the just to test the speed if this is going to help. Regards Marcelo. 2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com: Hi Marcelo, I could create a fast copy program by repurposing some python apps that I am using for benchmarking the python driver - do you still need this? With high levels of concurrency and multiple subprocess workers, based on my current actual benchmarks, I think I can get well over 1,000 rows/second on my mac and significantly more in AWS. I'm using variable size rows averaging 5kb. This would be the initial version of a piece of the benchmark suite we will release as part of our nyt⨍aбrik project on 21 June for my Cassandra Day NYC talk re the python driver. ml On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jens, Thanks for trying to help. Indeed, I know I can't do it using just CQL. But what would you use to migrate data manually? I tried to create a python program using auto paging, but I am getting timeouts. I also tried Hive, but no success. I only have two nodes and less than 200Gb in this cluster, any simple way to extract the data quickly would be good enough for me. Best regards, Marcelo. 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se: Hi Marcelo, Looks like you can't do this without migrating your data manually: https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql Cheers, Jens On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. I realized I created my column family with the wrong partition. Instead of: CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)) WITH caching=all; I used: CREATE TABLE IF NOT EXISTS entitylookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY (name, value, entity_id)) WITH caching=all; Now I need to migrate the data from the second CF to the first one. I am using Data Stax
Re: migration to a new model
OK Marcelo, I'll work on it today. -ml On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Michael, For sure I would be interested in this program! I am new both to python and for cql. I started creating this copier, but was having problems with timeouts. Alex solved my problem here on the list, but I think I will still have a lot of trouble making the copy to work fine. I open sourced my version here: https://github.com/s1mbi0se/cql_record_processor Just in case it's useful for anything. However, I saw CQL has support for concurrency itself and having something made by someone who knows Python CQL Driver better would be very helpful. My two servers today are at OVH (ovh.com), we have servers at AWS but but several cases we prefer other hosts. Both servers have SDD and 64 Gb RAM, I could use the script as a benchmark for you if you want. Besides, we have some bigger clusters, I could run on the just to test the speed if this is going to help. Regards Marcelo. 2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com: Hi Marcelo, I could create a fast copy program by repurposing some python apps that I am using for benchmarking the python driver - do you still need this? With high levels of concurrency and multiple subprocess workers, based on my current actual benchmarks, I think I can get well over 1,000 rows/second on my mac and significantly more in AWS. I'm using variable size rows averaging 5kb. This would be the initial version of a piece of the benchmark suite we will release as part of our nyt⨍aбrik project on 21 June for my Cassandra Day NYC talk re the python driver. ml On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jens, Thanks for trying to help. Indeed, I know I can't do it using just CQL. But what would you use to migrate data manually? I tried to create a python program using auto paging, but I am getting timeouts. I also tried Hive, but no success. I only have two nodes and less than 200Gb in this cluster, any simple way to extract the data quickly would be good enough for me. Best regards, Marcelo. 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se: Hi Marcelo, Looks like you can't do this without migrating your data manually: https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql Cheers, Jens On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. I realized I created my column family with the wrong partition. Instead of: CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)) WITH caching=all; I used: CREATE TABLE IF NOT EXISTS entitylookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY (name, value, entity_id)) WITH caching=all; Now I need to migrate the data from the second CF to the first one. I am using Data Stax Community Edition. What would be the best way to convert data from one CF to the other? Best regards, Marcelo.
Re: migration to a new model
BTW you might want to put a LIMIT clause on your SELECT for testing. -ml On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael michael.la...@nytimes.com wrote: Marcelo, Here is a link to the preview of the python fast copy program: https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 It will copy a table from one cluster to another with some transformation- they can be the same cluster. It has 3 main throttles to experiment with: 1. fetch_size: size of source pages in rows 2. worker_count: number of worker subprocesses 3. concurrency: number of async callback chains per worker subprocess It is easy to overrun Cassandra and the python driver, so I recommend starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency: 10. Additionally there are switches to set 'policies' by source and destination: retry (downgrade consistency), dc_aware, and token_aware. retry is useful if you are getting timeouts. For the others YMMV. To use it you need to define the SELECT and UPDATE cql statements as well as the 'map_fields' method. The worker subprocesses divide up the token range among themselves and proceed quasi-independently. Each worker opens a connection to each cluster and the driver sets up connection pools to the nodes in the cluster. Anyway there are a lot of processes, threads, callbacks going at once so it is fun to watch. On my regional cluster of small nodes in AWS I got about 3000 rows per second transferred after things warmed up a bit - each row about 6kb. ml On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael michael.la...@nytimes.com wrote: OK Marcelo, I'll work on it today. -ml On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Michael, For sure I would be interested in this program! I am new both to python and for cql. I started creating this copier, but was having problems with timeouts. Alex solved my problem here on the list, but I think I will still have a lot of trouble making the copy to work fine. I open sourced my version here: https://github.com/s1mbi0se/cql_record_processor Just in case it's useful for anything. However, I saw CQL has support for concurrency itself and having something made by someone who knows Python CQL Driver better would be very helpful. My two servers today are at OVH (ovh.com), we have servers at AWS but but several cases we prefer other hosts. Both servers have SDD and 64 Gb RAM, I could use the script as a benchmark for you if you want. Besides, we have some bigger clusters, I could run on the just to test the speed if this is going to help. Regards Marcelo. 2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com: Hi Marcelo, I could create a fast copy program by repurposing some python apps that I am using for benchmarking the python driver - do you still need this? With high levels of concurrency and multiple subprocess workers, based on my current actual benchmarks, I think I can get well over 1,000 rows/second on my mac and significantly more in AWS. I'm using variable size rows averaging 5kb. This would be the initial version of a piece of the benchmark suite we will release as part of our nyt⨍aбrik project on 21 June for my Cassandra Day NYC talk re the python driver. ml On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jens, Thanks for trying to help. Indeed, I know I can't do it using just CQL. But what would you use to migrate data manually? I tried to create a python program using auto paging, but I am getting timeouts. I also tried Hive, but no success. I only have two nodes and less than 200Gb in this cluster, any simple way to extract the data quickly would be good enough for me. Best regards, Marcelo. 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se: Hi Marcelo, Looks like you can't do this without migrating your data manually: https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql Cheers, Jens On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. I realized I created my column family with the wrong partition. Instead of: CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)) WITH caching=all; I used: CREATE TABLE IF NOT EXISTS entitylookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY (name, value, entity_id)) WITH caching=all; Now I need to migrate the data from the second CF to the first one. I am using Data Stax Community Edition. What would be the best way to convert data from one CF to the other? Best regards, Marcelo.
Re: migration to a new model
Hi Marcelo, I could create a fast copy program by repurposing some python apps that I am using for benchmarking the python driver - do you still need this? With high levels of concurrency and multiple subprocess workers, based on my current actual benchmarks, I think I can get well over 1,000 rows/second on my mac and significantly more in AWS. I'm using variable size rows averaging 5kb. This would be the initial version of a piece of the benchmark suite we will release as part of our nyt⨍aбrik project on 21 June for my Cassandra Day NYC talk re the python driver. ml On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jens, Thanks for trying to help. Indeed, I know I can't do it using just CQL. But what would you use to migrate data manually? I tried to create a python program using auto paging, but I am getting timeouts. I also tried Hive, but no success. I only have two nodes and less than 200Gb in this cluster, any simple way to extract the data quickly would be good enough for me. Best regards, Marcelo. 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se: Hi Marcelo, Looks like you can't do this without migrating your data manually: https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql Cheers, Jens On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. I realized I created my column family with the wrong partition. Instead of: CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)) WITH caching=all; I used: CREATE TABLE IF NOT EXISTS entitylookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY (name, value, entity_id)) WITH caching=all; Now I need to migrate the data from the second CF to the first one. I am using Data Stax Community Edition. What would be the best way to convert data from one CF to the other? Best regards, Marcelo.
Re: migration to a new model
Hi Marcelo, Looks like you can't do this without migrating your data manually: https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql Cheers, Jens On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. I realized I created my column family with the wrong partition. Instead of: CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)) WITH caching=all; I used: CREATE TABLE IF NOT EXISTS entitylookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY (name, value, entity_id)) WITH caching=all; Now I need to migrate the data from the second CF to the first one. I am using Data Stax Community Edition. What would be the best way to convert data from one CF to the other? Best regards, Marcelo.
Re: migration to a new model
Hi Jens, Thanks for trying to help. Indeed, I know I can't do it using just CQL. But what would you use to migrate data manually? I tried to create a python program using auto paging, but I am getting timeouts. I also tried Hive, but no success. I only have two nodes and less than 200Gb in this cluster, any simple way to extract the data quickly would be good enough for me. Best regards, Marcelo. 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se: Hi Marcelo, Looks like you can't do this without migrating your data manually: https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql Cheers, Jens On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. I realized I created my column family with the wrong partition. Instead of: CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)) WITH caching=all; I used: CREATE TABLE IF NOT EXISTS entitylookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY (name, value, entity_id)) WITH caching=all; Now I need to migrate the data from the second CF to the first one. I am using Data Stax Community Edition. What would be the best way to convert data from one CF to the other? Best regards, Marcelo.