BTW you might want to put a LIMIT clause on your SELECT for testing. -ml
On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael <michael.la...@nytimes.com> wrote: > Marcelo, > > Here is a link to the preview of the python fast copy program: > > https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 > > It will copy a table from one cluster to another with some transformation- > they can be the same cluster. > > It has 3 main throttles to experiment with: > > 1. fetch_size: size of source pages in rows > 2. worker_count: number of worker subprocesses > 3. concurrency: number of async callback chains per worker subprocess > > It is easy to overrun Cassandra and the python driver, so I recommend > starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency: > 10. > > Additionally there are switches to set 'policies' by source and > destination: retry (downgrade consistency), dc_aware, and token_aware. > retry is useful if you are getting timeouts. For the others YMMV. > > To use it you need to define the SELECT and UPDATE cql statements as well > as the 'map_fields' method. > > The worker subprocesses divide up the token range among themselves and > proceed quasi-independently. Each worker opens a connection to each cluster > and the driver sets up connection pools to the nodes in the cluster. Anyway > there are a lot of processes, threads, callbacks going at once so it is fun > to watch. > > On my regional cluster of small nodes in AWS I got about 3000 rows per > second transferred after things warmed up a bit - each row about 6kb. > > ml > > > On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael <michael.la...@nytimes.com > > wrote: > >> OK Marcelo, I'll work on it today. -ml >> >> >> On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle < >> marc...@s1mbi0se.com.br> wrote: >> >>> Hi Michael, >>> >>> For sure I would be interested in this program! >>> >>> I am new both to python and for cql. I started creating this copier, but >>> was having problems with timeouts. Alex solved my problem here on the list, >>> but I think I will still have a lot of trouble making the copy to work fine. >>> >>> I open sourced my version here: >>> https://github.com/s1mbi0se/cql_record_processor >>> >>> Just in case it's useful for anything. >>> >>> However, I saw CQL has support for concurrency itself and having >>> something made by someone who knows Python CQL Driver better would be very >>> helpful. >>> >>> My two servers today are at OVH (ovh.com), we have servers at AWS but >>> but several cases we prefer other hosts. Both servers have SDD and 64 Gb >>> RAM, I could use the script as a benchmark for you if you want. Besides, we >>> have some bigger clusters, I could run on the just to test the speed if >>> this is going to help. >>> >>> Regards >>> Marcelo. >>> >>> >>> 2014-06-03 11:40 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>: >>> >>> Hi Marcelo, >>>> >>>> I could create a fast copy program by repurposing some python apps that >>>> I am using for benchmarking the python driver - do you still need this? >>>> >>>> With high levels of concurrency and multiple subprocess workers, based >>>> on my current actual benchmarks, I think I can get well over 1,000 >>>> rows/second on my mac and significantly more in AWS. I'm using variable >>>> size rows averaging 5kb. >>>> >>>> This would be the initial version of a piece of the benchmark suite we >>>> will release as part of our nyt⨍aбrik project on 21 June for my >>>> Cassandra Day NYC talk re the python driver. >>>> >>>> ml >>>> >>>> >>>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle < >>>> marc...@s1mbi0se.com.br> wrote: >>>> >>>>> Hi Jens, >>>>> >>>>> Thanks for trying to help. >>>>> >>>>> Indeed, I know I can't do it using just CQL. But what would you use to >>>>> migrate data manually? I tried to create a python program using auto >>>>> paging, but I am getting timeouts. I also tried Hive, but no success. >>>>> I only have two nodes and less than 200Gb in this cluster, any simple >>>>> way to extract the data quickly would be good enough for me. >>>>> >>>>> Best regards, >>>>> Marcelo. >>>>> >>>>> >>>>> >>>>> 2014-06-02 15:08 GMT-03:00 Jens Rantil <jens.ran...@tink.se>: >>>>> >>>>> Hi Marcelo, >>>>>> >>>>>> Looks like you can't do this without migrating your data manually: >>>>>> https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql >>>>>> >>>>>> Cheers, >>>>>> Jens >>>>>> >>>>>> >>>>>> On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle < >>>>>> marc...@s1mbi0se.com.br> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. >>>>>>> >>>>>>> I realized I created my column family with the wrong partition. >>>>>>> Instead of: >>>>>>> >>>>>>> CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>>> name varchar, >>>>>>> value varchar, >>>>>>> entity_id uuid, >>>>>>> PRIMARY KEY ((name, value), entity_id)) >>>>>>> WITH >>>>>>> caching=all; >>>>>>> >>>>>>> I used: >>>>>>> >>>>>>> CREATE TABLE IF NOT EXISTS entitylookup ( >>>>>>> name varchar, >>>>>>> value varchar, >>>>>>> entity_id uuid, >>>>>>> PRIMARY KEY (name, value, entity_id)) >>>>>>> WITH >>>>>>> caching=all; >>>>>>> >>>>>>> >>>>>>> Now I need to migrate the data from the second CF to the first one. >>>>>>> I am using Data Stax Community Edition. >>>>>>> >>>>>>> What would be the best way to convert data from one CF to the other? >>>>>>> >>>>>>> Best regards, >>>>>>> Marcelo. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >