GitHub user my-ship-it edited a discussion: [Ideas] Should we support ON
CONFLICT to include the distribution keys?
### Description
Hackers,
Currently, when the ON CONFLICT statement involves distributed keys, an error
occurs, as shown below:
```
postgres=# create table tbl(v1 int, v2 int, v3 int, CONSTRAINT xx primary
key(v1, v2)) distributed by(v1, v2);
CREATE TABLE
postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE
SET v2=EXCLUDED.v2;
ERROR: modification of distribution columns in OnConflictUpdate is not
supported
```
However, it can work if it does not include distribution keys
```
postgres=# create table tbl(v1 int, v2 int, v3 int, CONSTRAINT xx primary
key(v1, v2)) distributed by(v1);
CREATE TABLE
postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE
SET v2=EXCLUDED.v2;
QUERY PLAN
-------------------------------------------------
Insert on tbl (cost=0.00..0.03 rows=0 width=0)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: xx
-> Result (cost=0.00..0.01 rows=1 width=12)
Optimizer: Postgres query optimizer
(5 rows)
It seems that the conflict operation of PG is implemented in the executor phase.
```
This is because we did some strict checks in the function
sanity_check_on_conflict_update, requiring that the columns updated on conflict
cannot include distribution keys.
At least, we could do some simple optimizations and more detailed checks, such
as if the distribution key is not updated,
error won't be thrown out.
Furthermore, if there is a modification of the distribution key, we can borrow
the way of SplitUpdate, introduce a new executor OnConflictSplitUpdate, and
convert the UPSERT operation to INSERT or DELETE + INSERT according to the
index check result. If the OnConflictSplitUpdate node checks the index and
finds a conflict, it will generate two operations, DELETE + INSERT, just like
the SplitUpdate node, otherwise only INSERT will be generated. The final update
is completed by Motion to the nodeModifyTable node (writer Gang) of the upper
layer
```
postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE
SET v1 = EXCLUDED.v1 + 1, v2=EXCLUDED.v1 + EXCLUDED.v2;
QUERY PLAN
----------------------------------------------------------------------------------------------------
-----
Update on public.tbl (cost=0.00..1397.25 rows=0 width=0)
-> Explicit Redistribute Motion 3:3 (slice1; segments: 3)
(cost=0.00..1397.25 rows=51933 width
=22)
Output: ((1 + 1)), ((1 + 2)), ctid, gp_segment_id, (DMLAction)
-> OnConflictSplit (cost=0.00..358.58 rows=51933 width=22)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: xx
Output: ((1 + 1)), ((1 + 2)), ctid, gp_segment_id, DMLAction
-> Seq Scan on public.tbl (cost=0.00..358.58 rows=25967
width=22)
Output: 1, 2, ctid, gp_segment_id
(9 rows)
```
Additional Motion will introduce costs, and we don't need to generate
OnConflictSplit nodes every time, such as when the distribution key is not
updated.
We need to implement in both GPORCA and legacy planner.
Of course, like the SplitUpdate type, not all queries can support this method,
and we need to do some detailed filtering and processing.
Any ideas are welcome.
### Use case/motivation
_No response_
### Related issues
_No response_
### Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
GitHub link: https://github.com/apache/cloudberry/discussions/902
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]