BTW you might want to put a LIMIT clause on your SELECT for testing. -ml

On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael <michael.la...@nytimes.com>
wrote:

> Marcelo,
>
> Here is a link to the preview of the python fast copy program:
>
> https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47
>
> It will copy a table from one cluster to another with some transformation-
> they can be the same cluster.
>
> It has 3 main throttles to experiment with:
>
>    1. fetch_size: size of source pages in rows
>    2. worker_count: number of worker subprocesses
>    3. concurrency: number of async callback chains per worker subprocess
>
> It is easy to overrun Cassandra and the python driver, so I recommend
> starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency:
> 10.
>
> Additionally there are switches to set 'policies' by source and
> destination: retry (downgrade consistency), dc_aware, and token_aware.
> retry is useful if you are getting timeouts. For the others YMMV.
>
> To use it you need to define the SELECT and UPDATE cql statements as well
> as the 'map_fields' method.
>
> The worker subprocesses divide up the token range among themselves and
> proceed quasi-independently. Each worker opens a connection to each cluster
> and the driver sets up connection pools to the nodes in the cluster. Anyway
> there are a lot of processes, threads, callbacks going at once so it is fun
> to watch.
>
> On my regional cluster of small nodes in AWS I got about 3000 rows per
> second transferred after things warmed up a bit - each row about 6kb.
>
> ml
>
>
> On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael <michael.la...@nytimes.com
> > wrote:
>
>> OK Marcelo, I'll work on it today. -ml
>>
>>
>> On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle <
>> marc...@s1mbi0se.com.br> wrote:
>>
>>> Hi Michael,
>>>
>>> For sure I would be interested in this program!
>>>
>>> I am new both to python and for cql. I started creating this copier, but
>>> was having problems with timeouts. Alex solved my problem here on the list,
>>> but I think I will still have a lot of trouble making the copy to work fine.
>>>
>>> I open sourced my version here:
>>> https://github.com/s1mbi0se/cql_record_processor
>>>
>>> Just in case it's useful for anything.
>>>
>>> However, I saw CQL has support for concurrency itself and having
>>> something made by someone who knows Python CQL Driver better would be very
>>> helpful.
>>>
>>> My two servers today are at OVH (ovh.com), we have servers at AWS but
>>> but several cases we prefer other hosts. Both servers have SDD and 64 Gb
>>> RAM, I could use the script as a benchmark for you if you want. Besides, we
>>> have some bigger clusters, I could run on the just to test the speed if
>>> this is going to help.
>>>
>>> Regards
>>> Marcelo.
>>>
>>>
>>> 2014-06-03 11:40 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>:
>>>
>>> Hi Marcelo,
>>>>
>>>> I could create a fast copy program by repurposing some python apps that
>>>> I am using for benchmarking the python driver - do you still need this?
>>>>
>>>> With high levels of concurrency and multiple subprocess workers, based
>>>> on my current actual benchmarks, I think I can get well over 1,000
>>>> rows/second on my mac and significantly more in AWS. I'm using variable
>>>> size rows averaging 5kb.
>>>>
>>>> This would be the initial version of a piece of the benchmark suite we
>>>> will release as part of our nyt⨍aбrik project on 21 June for my
>>>> Cassandra Day NYC talk re the python driver.
>>>>
>>>> ml
>>>>
>>>>
>>>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle <
>>>> marc...@s1mbi0se.com.br> wrote:
>>>>
>>>>> Hi Jens,
>>>>>
>>>>> Thanks for trying to help.
>>>>>
>>>>> Indeed, I know I can't do it using just CQL. But what would you use to
>>>>> migrate data manually? I tried to create a python program using auto
>>>>> paging, but I am getting timeouts. I also tried Hive, but no success.
>>>>> I only have two nodes and less than 200Gb in this cluster, any simple
>>>>> way to extract the data quickly would be good enough for me.
>>>>>
>>>>> Best regards,
>>>>> Marcelo.
>>>>>
>>>>>
>>>>>
>>>>> 2014-06-02 15:08 GMT-03:00 Jens Rantil <jens.ran...@tink.se>:
>>>>>
>>>>> Hi Marcelo,
>>>>>>
>>>>>> Looks like you can't do this without migrating your data manually:
>>>>>> https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql
>>>>>>
>>>>>> Cheers,
>>>>>> Jens
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle <
>>>>>> marc...@s1mbi0se.com.br> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.
>>>>>>>
>>>>>>> I realized I created my column family with the wrong partition.
>>>>>>> Instead of:
>>>>>>>
>>>>>>> CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>>   name varchar,
>>>>>>>   value varchar,
>>>>>>>   entity_id uuid,
>>>>>>>   PRIMARY KEY ((name, value), entity_id))
>>>>>>> WITH
>>>>>>>     caching=all;
>>>>>>>
>>>>>>> I used:
>>>>>>>
>>>>>>> CREATE TABLE IF NOT EXISTS entitylookup (
>>>>>>>   name varchar,
>>>>>>>   value varchar,
>>>>>>>   entity_id uuid,
>>>>>>>   PRIMARY KEY (name, value, entity_id))
>>>>>>> WITH
>>>>>>>     caching=all;
>>>>>>>
>>>>>>>
>>>>>>> Now I need to migrate the data from the second CF to the first one.
>>>>>>> I am using Data Stax Community Edition.
>>>>>>>
>>>>>>> What would be the best way to convert data from one CF to the other?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Marcelo.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to