Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/20760 )
Change subject: IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables. ...................................................................... Patch Set 10: (9 comments) http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-delete-sink.cc File be/src/exec/iceberg-delete-sink.cc: http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-delete-sink.cc@79 PS5, Line 79: VerifyRowsNotDuplicated > file paths and positions are not sorted across partitions. So we would need Ok, it can stay as it is. http://gerrit.cloudera.org:8080/#/c/20760/10/be/src/exec/table-sink-base.h File be/src/exec/table-sink-base.h: http://gerrit.cloudera.org:8080/#/c/20760/10/be/src/exec/table-sink-base.h@90 PS10, Line 90: must already fill Nit: "must already have filled". http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java File fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java: http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@115 PS6, Line 115: If there are duplicates in the JOIN operator > I'm not sure what is the point here. Duplicates are only possible in the co It is a bit nit-picky, I meant that in the sentence "If there are duplicates [...] then we cannot do duplicate checking in the SINK if ..." the condition at the beginning is not necessary - if it happens that there are actually no duplicates we still can't check for them if the rows are shuffled independently. I'd suggest something like this: """ In case of a JOIN, if duplicated rows can be shuffled independently, we cannot do duplicate checking in the SINK. This is the case when ... """ http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@126 PS6, Line 126: via UPDATE FROM > There will be always at least one tableRef because of the target table. I wanted to ask if it is possible that modifyStmt_.fromClause_.size() == 1. 1. If it is possible, then in that case the exception (currently) won't be thrown. 1a) If it should be thrown we should remove that condition. 1b) Otherwise, the error message lists the conditions that were needed to trigger the error: - partition column, - non-constant RHS -> in this case we should include "more than one table ref in the FROM clause" as well 2. If modifyStmt_.fromClause_.size() == 1 is not possible, we should remove the relevant part of the condition on L123 and add a precondition check instead. http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java File fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java: http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@101 PS6, Line 101: public TSortingOrder getSortingOrder() { > There's good chance we will need it later, e.g. optimizing a table that has Ok, it should stay then. http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java File fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java: http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java@34 PS6, Line 34: TableSink > It may have some value now, as there are some common fields/methods, but I' Ok, if IcebergDeleteSink will probably be deleted we can leave it as it is now. But we should open a Jira about it then. http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql@3407 PS7, Line 3407: E TA > Makes sense, I never really thought about this as I usually re-load my tabl I agree, let's not make this patch even bigger. http://gerrit.cloudera.org:8080/#/c/20760/10/testdata/workloads/functional-planner/queries/PlannerTest/insert-sort-by-zorder.test File testdata/workloads/functional-planner/queries/PlannerTest/insert-sort-by-zorder.test: http://gerrit.cloudera.org:8080/#/c/20760/10/testdata/workloads/functional-planner/queries/PlannerTest/insert-sort-by-zorder.test@400 PS10, Line 400: 1 Are these changes compared to PS7 because of a rebase? http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test: http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@252 PS7, Line 252: FROM clause > I think yes, otherwise you cannot have a join that produces duplicates. I asked because I'm unsure whether we should add "multiple tables" to the comment. This is related to the comment at https://gerrit.cloudera.org/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@126 -- To view, visit http://gerrit.cloudera.org:8080/20760 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bb97b4454165a292975d88dc9c23adb22ff7315 Gerrit-Change-Number: 20760 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tamas Mate <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 13 Dec 2023 13:24:34 +0000 Gerrit-HasComments: Yes
