Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20760 )

Change subject: IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables.
......................................................................


Patch Set 10:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-delete-sink.cc
File be/src/exec/iceberg-delete-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-delete-sink.cc@79
PS5, Line 79: VerifyRowsNotDuplicated
> file paths and positions are not sorted across partitions. So we would need
Ok, it can stay as it is.


http://gerrit.cloudera.org:8080/#/c/20760/10/be/src/exec/table-sink-base.h
File be/src/exec/table-sink-base.h:

http://gerrit.cloudera.org:8080/#/c/20760/10/be/src/exec/table-sink-base.h@90
PS10, Line 90: must already fill
Nit: "must already have filled".


http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java:

http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@115
PS6, Line 115: If there are duplicates in the JOIN operator
> I'm not sure what is the point here. Duplicates are only possible in the co
It is a bit nit-picky, I meant that in the sentence "If there are duplicates 
[...] then we cannot do duplicate checking in the SINK if ..." the condition at 
the beginning is not necessary - if it happens that there are actually no 
duplicates we still can't check for them if the rows are shuffled 
independently. I'd suggest something like this:

"""
In case of a JOIN, if duplicated rows can be shuffled independently, we cannot 
do duplicate checking in the SINK.
This is the case when ...
"""


http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@126
PS6, Line 126: via UPDATE FROM
> There will be always at least one tableRef because of the target table.
I wanted to ask if it is possible that modifyStmt_.fromClause_.size() == 1.

1. If it is possible, then in that case the exception (currently) won't be 
thrown.
 1a) If it should be thrown we should remove that condition.
 1b) Otherwise, the error message lists the conditions that were needed to 
trigger the error:
  - partition column,
  - non-constant RHS
  -> in this case we should include "more than one table ref in the FROM 
clause" as well

2. If modifyStmt_.fromClause_.size() == 1 is not possible, we should remove the 
relevant part of the condition on L123 and add a precondition check instead.


http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java
File fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java:

http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@101
PS6, Line 101:   public TSortingOrder getSortingOrder() {
> There's good chance we will need it later, e.g. optimizing a table that has
Ok, it should stay then.


http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java
File fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java:

http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java@34
PS6, Line 34: TableSink
> It may have some value now, as there are some common fields/methods, but I'
Ok, if IcebergDeleteSink will probably be deleted we can leave it as it is now. 
But we should open a Jira about it then.


http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql@3407
PS7, Line 3407: E TA
> Makes sense, I never really thought about this as I usually re-load my tabl
I agree, let's not make this patch even bigger.


http://gerrit.cloudera.org:8080/#/c/20760/10/testdata/workloads/functional-planner/queries/PlannerTest/insert-sort-by-zorder.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/insert-sort-by-zorder.test:

http://gerrit.cloudera.org:8080/#/c/20760/10/testdata/workloads/functional-planner/queries/PlannerTest/insert-sort-by-zorder.test@400
PS10, Line 400: 1
Are these changes compared to PS7 because of a rebase?


http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test:

http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@252
PS7, Line 252: FROM clause
> I think yes, otherwise you cannot have a join that produces duplicates.
I asked because I'm unsure whether we should add "multiple tables" to the 
comment. This is related to the comment at 
https://gerrit.cloudera.org/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@126



--
To view, visit http://gerrit.cloudera.org:8080/20760
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bb97b4454165a292975d88dc9c23adb22ff7315
Gerrit-Change-Number: 20760
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 13 Dec 2023 13:24:34 +0000
Gerrit-HasComments: Yes

Reply via email to