RE: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Durity, Sean R Tue, 12 Mar 2019 12:58:07 -0700

If there are 2 access patterns, I would consider having 2 tables. The first one 
with the ID, which you say is the majority use case.  Then have a second table 
that uses a time-bucket approach as others have suggested:
(time bucket, id) as primary key
Choose a time bucket (day, week, hour, month, whatever) that would hold less 
than 100 MB of data in the time-bucket partition.


You could include all relevant data in the second table to meet your query. OR, 
if that data seems too large or too volatile to duplicate, just include your 
primary key and look-up the data in the primary table as needed.

If you use allow filtering, you are setting yourself up for failure to scale. I 
tell my developers, "if you use allow filtering, you are doing it wrong." In 
fact, I think the Cassandra admin should be able to set a flag in 
cassandra.yaml to not allow filtering at all. The cluster should be able to 
protect itself from bad queries.



From: Leena Ghatpande <lghatpa...@hotmail.com>
Sent: Tuesday, March 12, 2019 9:02 AM
To: Stefan Miklosovic <stefan.mikloso...@instaclustr.com>; 
user@cassandra.apache.org
Subject: [EXTERNAL] Re: Migrate large volume of data from one table to another 
table within the same cluster when COPY is not an option.

Our data model cannot be like below as you have recommended as majority of the 
reads need to select the data by the partition key (id) only, not by date.
You could remodel your data in such way that you would make primary key like 
this
((date), hour-minute, id)
or
((date, hour-minute), id)


By adding the date as clustering column, yes the idea was to use the Allow 
Filtering on the date and pull the records. Understand that it is not 
recommended to do this, but we have been doing this on another existing large 
table and have not run into any issue so far. But want to understand if there 
is a better approach to this?

Thanks

________________________________
From: Stefan Miklosovic 
<stefan.mikloso...@instaclustr.com<mailto:stefan.mikloso...@instaclustr.com>>
Sent: Monday, March 11, 2019 7:12 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Migrate large volume of data from one table to another table 
within the same cluster when COPY is not an option.

The query which does not work should be like this, I made a mistake there

cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot 
execute this query as it might involve data filtering and thus may have 
unpredictable performance. If you want to execute this query despite the 
performance unpredictability, use ALLOW FILTERING"


On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic 
<stefan.mikloso...@instaclustr.com<mailto:stefan.mikloso...@instaclustr.com>> 
wrote:
Hi Leena,

"We are thinking of creating a new table with a date field as a clustering 
column to be able to query for date ranges, but partition key to clustering key 
will be 1-1. Is this a good approach?"

If you want to select by some time range here, I am wondering how would making 
datetime a clustering column help you here? You still have to provide primary 
key, right?

E.g. select * from your_keyspace.your_table where id=123 and my_date > 
yesterday and my_date < tomorrow (you got the idea)

If you make my_date clustering column, you cant not do this below, because you 
still have to specify partition key fully and then clustering key (optionally) 
where you can further order and do ranges. But you cant do a query without 
specifying partition key. Well, you can use ALLOW FILTERING but you do not want 
to do this at all in your situation as it would scan everything.

select * from your_keyspace.your_table where my_date > yesterday and my_date < 
tomorrow

cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': 
'NetworkTopologyStrategy', 'dc1': '1'};
cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY 
((id), number));

cqlsh> SELECT * from my_keyspace.my_table ;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
 abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
 8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2

cqlsh> SELECT * from my_keyspace.my_table where id = 
'6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid 
STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for "id" of type uuid"

cqlsh> SELECT * from my_keyspace.my_table where id = 
6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3

You could remodel your data in such way that you would make primary key like 
this

((date), hour-minute, id)

or

((date, hour-minute), id)

I would prefer the second one because if you expect a lot of data per day, they 
would all end up on same set of replicas as hash of partition key would be same 
whole day if you have same date all day so I think you would end up with 
hotspots. You want to have your data spread more evenly so the second one seems 
to be better to me.

You can also investigate how to do this with materialized view but I am not 
sure about the performance here.

If you want to copy data you can do this e.g. by Cassandra Spark connector, you 
would just read table and as you read it you would write to another one. That 
is imho the fastest approach and the least error prone. You can do that on live 
production data and you can just make a "switch" afterwards. Not sure about 
ttls but that should be transparent while copying that.

On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande 
<lghatpa...@hotmail.com<mailto:lghatpa...@hotmail.com>> wrote:

We have a table with over 70M rows with a partition key that is unique.  We 
have a  created datetime stamp on each record, and we have a need to select all 
rows created for a date range. Secondary index is not an option as its high 
cardinality and could slow performance doing a full scan on 70M rows.



We are thinking of creating a new table with a date field as a clustering 
column to be able to query for date ranges, but partition key to clustering key 
will be 1-1. Is this a good approach?

To do this, we need to copy this large volume of data from table1 to table2 
within the same cluster, while updates are still happening to table1. We need 
to do this real time without impacting our customers. COPY is not an option, as 
we have ttl's on each row on table1 that need to be applied to table2 as well.



So what would be the best approach

  1.  To be able select data using date range without impacting performance. 
This operation will be needed only on adhoc basis and it wont be as frequent .
  2.  Best way to migrate large volume of data with ttl from one table to 
another within the same cluster.



Any other suggestions also will be greatly appreciated.



Stefan Miklosovic

Stefan Miklosovic

________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

RE: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Reply via email to