RE: [EXTERNAL] n00b q re UPDATE v. INSERT in CQL

2019-10-25 Thread Durity, Sean R
Everything in Cassandra is an insert. So, an update and an insert are 
functionally equivalent. An update doesn't go update the existing data on disk; 
it is a new write of the columns involved. So, the difference in your scenario 
is that with the "targeted" update, you are writing less of the columns again.

So, instead of inserting 10 values (for example), you are inserting 3 (pk1, 
pk2, and col1). This would mean less disk space used for your data cleanup. 
Once Cassandra runs its compaction across your data (with a simplifying 
assumption that all data gets compacted), there would be no disk space 
difference for the final result.

I would do the updates, if the size/scope of the data involved is significant.


Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: James A. Robinson 
Sent: Friday, October 25, 2019 10:49 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] n00b q re UPDATE v. INSERT in CQL

Hi folks,

I'm working on a clean-up task for some bad data in a cassandra db.
The bad data in this case are values with mixed case that will need to
be lowercased.  In some tables the value that needs to be changed is a
primary key, in other cases it is not.

From the reading I've done, the situations where I need to change a
primary key column to lowercase will mean I need to perform an INSERT
of the entire row using the new primary key values merged with the old
non-primary-key values, followed by a DELETE of the old primary key
row.

My question is, on a table where I need to update a column that isn't
primary key, should I perform a limited UPDATE in that situation like
I would in SQL:

UPDATE ks.table SET col1 = ? WHERE pk1 = ? AND pk2 = ?

or will there be any downsides to that over an INSERT where I specify
all columns?

INSERT INTO ks.table (pk1, pk2, col1, col2, ...) VALUES (?,?,?,?, ...)

In SQL I'd never question just using the update but my impression
reading the blogosphere is that Cassandra has subtleties that I might
not be grasping when it comes to UPDATE v. INSERT behavior...

Jim

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


RE: Cassandra Rack - Datacenter Load Balancing relations

2019-10-25 Thread Durity, Sean R
+1 for removing complexity to be able to create (and maintain!) “reasoned” 
systems!


Sean Durity – Staff Systems Engineer, Cassandra

From: Reid Pinchback 
Sent: Thursday, October 24, 2019 10:28 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Cassandra Rack - Datacenter Load Balancing relations

Hey Sergio,

Forgive but I’m at work and had to skim the info quickly.

When in doubt, simplify.  So 1 rack per DC.  Distributed systems get rapidly 
harder to reason about the more complicated you make them.  There’s more than 
enough to learn about C* without jumping into the complexity too soon.

To deal with the unbalancing issue, pay attention to Jon Haddad’s advice on 
vnode count and how to fairly distribute tokens with a small vnode count.  I’d 
rather point you to his information, as I haven’t dug into vnode counts and 
token distribution in detail; he’s got a lot more time in C* than I do.  I come 
at this more as a traditional RDBMS and Java guy who has slowly gotten up to 
speed on C* over the last few years, and dealt with DynamoDB a lot so have 
lived with a lot of similarity in data modelling concerns.  Detailed internals 
I only know in cases where I had reason to dig into C* source.

There are so many knobs to turn in C* that it can be very easy to overthink 
things.  Simplify where you can.  Remove GC pressure wherever you can.  
Negotiate with your consumers to have data models that make sense for C*.  If 
you have those three criteria foremost in mind, you’ll likely be fine for quite 
some time.  And in the times where something isn’t going well, simpler is 
easier to investigate.

R

From: Sergio mailto:lapostadiser...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, October 23, 2019 at 3:34 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: Cassandra Rack - Datacenter Load Balancing relations

Message from External Sender
Hi Reid,

Thank you very much for clearing these concepts for me.
https://community.datastax.com/comments/1133/view.html
 I posted this question on the datastax forum regarding our cluster that it is 
unbalanced and the reply was related that the number of racks should be a 
multiplier of the replication factor in order to be balanced or 1. I thought 
then if I have 3 availability zones I should have 3 racks for each datacenter 
and not 2 (us-east-1b, us-east-1a) as I have right now or in the easiest way, I 
should have a rack for each datacenter.



1.   Datacenter: live

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens   OwnsHost ID
   Rack
UN  10.1.20.49   289.75 GiB  256  ?   
be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
UN  10.1.30.112  103.03 GiB  256  ?   
e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
UN  10.1.19.163  129.61 GiB  256  ?   
3c2efdda-8dd4-4f08-b991-9aff062a5388  us-east-1a
UN  10.1.26.181  145.28 GiB  256  ?   
0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
UN  10.1.17.213  149.04 GiB  256  ?   
71563e86-b2ae-4d2c-91c5-49aa08386f67  us-east-1a
DN  10.1.19.198  52.41 GiB  256  ?   
613b43c0-0688-4b86-994c-dc772b6fb8d2  us-east-1b
UN  10.1.31.60   195.17 GiB  256  ?   
3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
UN  10.1.25.206  100.67 GiB  256  ?   
f43532ad-7d2e-4480-a9ce-2529b47f823d  us-east-1b
So each rack label right now matches the availability zone and we have 3 
Datacenters and 2 Availability Zone with 2 racks per DC but the above is 
clearly unbalanced
If I have a keyspace with a replication factor = 3 and I want to minimize the 
number of nodes to scale up and down the cluster and keep it balanced should I 
consider an approach like OPTION A)

2.   Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a

3.   3 read ONE us-east-1a

4.   4 write ONE us-east-1b 5 write ONE us-east-1b

5.   6 write ONE us-east-1b

6.   OPTION B)

7.   Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a

8.   3 read ONE us-east-1a

9.   4 write TWO us-east-1b 5 write TWO us-east-1b

10.   6 write TWO us-east-1b

11.   7 read ONE us-east-1c 8 write TWO us-east-1c

12.   9 read ONE us-east-1c Option B looks to be unbalanced and I would exclude 
it OPTION C)

13.   Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b

14.   3 read ONE us-east-1c

15.   4 write TWO us-east-1a 5 write TWO us-east-1b

16.   6 write TWO us-east-1c

17.
so I am thinking of A if I have the 

n00b q re UPDATE v. INSERT in CQL

2019-10-25 Thread James A. Robinson
Hi folks,

I'm working on a clean-up task for some bad data in a cassandra db.
The bad data in this case are values with mixed case that will need to
be lowercased.  In some tables the value that needs to be changed is a
primary key, in other cases it is not.

>From the reading I've done, the situations where I need to change a
primary key column to lowercase will mean I need to perform an INSERT
of the entire row using the new primary key values merged with the old
non-primary-key values, followed by a DELETE of the old primary key
row.

My question is, on a table where I need to update a column that isn't
primary key, should I perform a limited UPDATE in that situation like
I would in SQL:

UPDATE ks.table SET col1 = ? WHERE pk1 = ? AND pk2 = ?

or will there be any downsides to that over an INSERT where I specify
all columns?

INSERT INTO ks.table (pk1, pk2, col1, col2, ...) VALUES (?,?,?,?, ...)

In SQL I'd never question just using the update but my impression
reading the blogosphere is that Cassandra has subtleties that I might
not be grasping when it comes to UPDATE v. INSERT behavior...

Jim

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org