Wondering how cql3 DISTINCT query is implemented

2018-10-22 Thread Jing Meng
Hi, we built a simple system to migrate live cassandra data to other
databases, mainly by using these queries:

1. SELECT DISTINCT TOKEN(partition_key) FROM table WHERE
TOKEN(partition_key) > current_offset AND TOKEN(partition_key) <=
upper_bound LIMIT token_fetch_size
2. Any cql query that retrieves all rows, given a set of tokens

And we observed that the "SELECT DISTINCT TOKEN" query takes way longer
when the table is wide partitioned (about 200+ rows on average), look like
the underlying operation is not linear.

Is it that the query would scan every rows of every partitions found until
token_fetch_size is met? Or is it due to some low-level operations that are
naturally more time consuming when dealing with wide partitioned data?

Any advice on this question or where to find the concerning code would be
appreciated.


Re: Suggestions for migrating data from cassandra

2018-05-16 Thread Jing Meng
We would try migration for some small keyspaces (with data of serveral
gigabytes across a dc) first,
but ultimately migration for several large keyspaces with data size ranged
from 100G to 5T, some tables having >1T data, would be scheduled too.

As for StreamSets/Talend, personally I doubt if using that would be
appropriate at our company, as manpower is pretty restricted for this
migration.

Arbab's answer actually resolved my initial concern, now trying to play
with spark-connector.

Thanks for all your replies, much appreciated!

2018-05-16 5:35 GMT+08:00 Joseph Arriola <jcarrio...@gmail.com>:

> Hi Jing.
>
> How much information do you need to migrate? in volume and number of
> tables?
>
> With Spark could you do the follow:
>
>- Read the data and export directly to MySQL.
>- Read the data and export to csv files and after load to MySQL.
>
>
> Could you use other paths such as:
>
>- StreamSets
>- Talend Open Studio
>- Kafka Streams.
>
>
>
>
> 2018-05-15 4:59 GMT-06:00 Jing Meng <self.rel...@gmail.com>:
>
>> Hi guys, for some historical reason, our cassandra cluster is currently
>> overloaded and operating on that somehow becomes a nightmare. Anyway,
>> (sadly) we're planning to migrate cassandra data back to mysql...
>>
>> So we're not quite clear how to migrating the historical data from
>> cassandra.
>>
>> While as I know there is the COPY command, I wonder if it works in
>> product env where more than hundreds gigabytes data are present. And, if it
>> does, would it impact server performance significantly?
>>
>> Apart from that, I know spark-connector can be used to scan data from c*
>> cluster, but I'm not that familiar with spark and still not sure whether
>> write data to mysql database can be done naturally with spark-connector.
>>
>> Are there any suggestions/best-practice/read-materials doing this?
>>
>> Thanks!
>>
>
>


Suggestions for migrating data from cassandra

2018-05-15 Thread Jing Meng
Hi guys, for some historical reason, our cassandra cluster is currently
overloaded and operating on that somehow becomes a nightmare. Anyway,
(sadly) we're planning to migrate cassandra data back to mysql...

So we're not quite clear how to migrating the historical data from
cassandra.

While as I know there is the COPY command, I wonder if it works in product
env where more than hundreds gigabytes data are present. And, if it does,
would it impact server performance significantly?

Apart from that, I know spark-connector can be used to scan data from c*
cluster, but I'm not that familiar with spark and still not sure whether
write data to mysql database can be done naturally with spark-connector.

Are there any suggestions/best-practice/read-materials doing this?

Thanks!


Question upon gracefully restarting c* node(s)

2018-01-01 Thread Jing Meng
Hi all.

Recently we made a change to our production env c* cluster (2.1.18) -
placing the commit log to the same SSD where data is stored, which needs
restarting all nodes.

Before restarting a cassandra node, we ran the following nodetool utils:
$ nodetool disablethrift && sleep 5
$ nodetool disablebinary && sleep 5
$ nodetool disable gossip && sleep 5
$ nodetool drain && sleep 5

It was "graceful" as expected (no significant errors found), but the
process is still a myth to us: are those commands used above "sufficient",
and/or why? The offical doc (docs.datastax.com) did not help with this
operation detail, though "nodetool drain" is apparently essential.


Re: How to minimize side effects induced by tombstones when using deletion?

2017-08-01 Thread Jing Meng
Thanks, we'll try delete range of rows as it seems to fit our scenario.
One more question, as you mentioned "repair often" - and we have seen that
several times, the official doc, representations, blogs, etc.

But when we repair a column family sized to terabytes on a cluster with ~30
nodes, it takes almost a week long, and like, always ends with some
unexpected failure.
Are we missing something here, or is it reasonable at this magnitude? Also,
if we repaired once successfully, will the next repair process take a more
reasonable time?




2017-08-01 14:08 GMT+08:00 Jeff Jirsa <jji...@gmail.com>:

> Delete using as few tombstones as possible (deleting the whole partition
> is better than deleting a row; deleting a range of rows is better than
> deleting many rows in a range).
>
> Repair often and lower gc_grace_seconds so the tombstones can be collected
> more frequently
>
>
> --
> Jeff Jirsa
>
>
> On Jul 31, 2017, at 11:02 PM, Jing Meng <self.rel...@gmail.com> wrote:
>
> Hi there.
>
>
> We have a keyspace containing tons of records, and deletions are used as
> enforced by its business logic.
>
> As the data accumulates, we are suffering from performance penalty due to
> tombstones, still confusing about what could be done to minimize the harm,
> or shall we avoid any deletions and adapt our code?
>
> FYI, if it’s concerned, we are using C* 2.1.18.
>
>
> Thanks for your urgent replying.
>
>
>


How to minimize side effects induced by tombstones when using deletion?

2017-08-01 Thread Jing Meng
Hi there.


We have a keyspace containing tons of records, and deletions are used as
enforced by its business logic.

As the data accumulates, we are suffering from performance penalty due to
tombstones, still confusing about what could be done to minimize the harm,
or shall we avoid any deletions and adapt our code?

FYI, if it’s concerned, we are using C* 2.1.18.


Thanks for your urgent replying.