Batches are for atomicity, not performance.

I would do single deletes with a prepared statement. An IN clause causes extra 
work for the coordinator because multiple partitions are being impacted. So, 
the coordinator has to coordinate all nodes involved in those writes (up to the 
whole cluster). Availability and performance are compromised for multiple 
partition operations. I do not allow them.

Also – TTL at insert (or update) is a much better solution than large purge 
strategies. As someone who spent a month wrangling hundreds of billions of 
deletes, I am an ardent preacher of TTL during design time.

Sean Durity

From: Attila Wind <attilaw@swf.technology>
Sent: Friday, February 21, 2020 2:52 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: IN OPERATOR VS BATCH QUERY

Hi Sergio,

AFAIK you use batches when you want to get "all or nothing" approach from 
Cassandra. So turning multiple statements into one atomic operation.

One very typical use case for this is when you have denormalized data in 
multiple tables (optimized for different queries) but you need to modify all of 
them the same way as they were just one entity.

This means that if any ofyour delete statements would fail for whatever reason 
then all of your delete statements would be rolled back.

I think you dont want that overhead here for sure...

We are not there yet with our development but we will need similar "cleanup" 
functionality soon.
I was also thinking about the IN operator for similar cases but I am curious if 
anyone here has better idea...
Why does the IN operator blowing up the coordinator? I do not entirely get it...

Thanks
Attila

Sergio <lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>> ezt írta 
(időpont: 2020. febr. 21., P 3:44):
The current approach is delete from key_value where id = whatever and it is 
performed asynchronously from the client.
I was thinking to reduce at least the network round-trips between client  and 
coordinator with that Batch approach. :)

In any case, I would test it it will improve or not. So when do you use batch 
then?

Best,

Sergio

On Thu, Feb 20, 2020, 6:18 PM Erick Ramirez 
<erick.rami...@datastax.com<mailto:erick.rami...@datastax.com>> wrote:
Batches aren't really meant for optimisation in the same way as RDBMS. If 
anything, it will just put pressure on the coordinator having to fire off 
multiple requests to lots of replicas. The IN operator falls into the same 
category and I personally wouldn't use it with more than 2 or 3 partitions 
because then the coordinator will suffer from the same problem.

If it were me, I'd just issue single-partition deletes and throttle it to a 
"reasonable" throughput that your cluster can handle. The word "reasonable" is 
in quotes because only you can determine that magic number for your cluster 
through testing. Cheers!

________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

Reply via email to