Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/20677 )
Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables ...................................................................... Patch Set 3: (50 comments) http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@10 PS3, Line 10: any of : the followings are Nit: "any of the following is". http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@46 PS3, Line 46: Insert sink (i, s, 5) Just an idea: How difficult / expensive would it be to check the rows we update whether the new row value differs from the old one? We could skip rows that already have the desired value. Even if it's worth the effort it should probably be in a different change. http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@66 PS3, Line 66: Why this patch has Nit: "Why does this patch have..." would be better. http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@80 PS3, Line 80: smaller problem : then inefficient data files Why is it a smaller problem? Is it because we expect that there is more data in data files compared to delete files? http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc File be/src/exec/iceberg-delete-sink.cc: http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc@117 PS3, Line 117: existing_partition Not modified in this change, but does this variable do anything? http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc@137 PS3, Line 137: dml_exec_state_ Does 'dml_exec_state_' need to be the same as 'state->dml_exec_state()', which was used before this patch? If so, we could add a DCHECK. http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.h File be/src/exec/multi-table-sink.h: http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.h@69 PS3, Line 69: /// END. I'd consider writing the explanatory comment of BEGIN also here ("Following methods just delegate...") because in case other methods are added it could become difficult to understand what END refers to. http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc File be/src/exec/multi-table-sink.cc: http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc@52 PS3, Line 52: (TableSinkBase*) Instead of a C-style cast we could use static_cast, maybe with a DCHECK containing a dynamic cast. http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc@94 PS3, Line 94: closed_ = true; Shouldn't DataSink::Close() already set 'closed_' to true? We could also replace it with DCHECK(closed_). http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc File be/src/exec/table-sink-base.cc: http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc@419 PS3, Line 419: if (dml_exec_state == nullptr) dml_exec_state = state->dml_exec_state(); When is it possible that 'dml_exec_state' is not null and is not the same as 'state->dml_exec_state()'? For example, is it possible if 'is_delete' is false? http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc@430 PS3, Line 430: IsIceberg() If 'is_delete' is true, shouldn't this always be true also? We could add a DCHECK for it. http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc File be/src/runtime/dml-exec-state.cc: http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc@483 PS3, Line 483: ASDF What does ASDF mean? http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc@565 PS3, Line 565: is_iceberg Shouldn't 'is_iceberg' always be true for delete files? http://gerrit.cloudera.org:8080/#/c/20677/3/common/protobuf/control_service.proto File common/protobuf/control_service.proto: http://gerrit.cloudera.org:8080/#/c/20677/3/common/protobuf/control_service.proto@95 PS3, Line 95: 8 Did you intentionally skip 7? http://gerrit.cloudera.org:8080/#/c/20677/3/common/thrift/DataSinks.thrift File common/thrift/DataSinks.thrift: http://gerrit.cloudera.org:8080/#/c/20677/3/common/thrift/DataSinks.thrift@187 PS3, Line 187: 10: optional list<TDataSink> data_sinks Maybe 'child_data_sinks' would be better. Could also add a comment that it is used for the children of MULTI_DATA_SINKs. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@65 PS3, Line 65: private final Map<FeTable, Integer> additionalTargetTableIds_ = new HashMap<>(); Shouldn't this come directly after 'targetTable_'? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java File fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java@20 PS3, Line 20: import java.util.HashMap; : import java.util.HashSet; Unused imports. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java@23 PS3, Line 23: import java.util.Map; : import java.util.Set; : import java.util.stream.Collectors; Unused imports. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java@27 PS3, Line 27: import org.apache.impala.catalog.Column; Unused import. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java File fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@36 PS3, Line 36: createSourceStmt I can't find the 'createSourceStmt()' that sets these variables. Aren't they set in 'buildAndValidateSelectExprs()'? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@53 PS3, Line 53: // Make the virtual position delete table the new target table. This comment seems stale, the following line was deleted. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java File fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@48 PS3, Line 48: private int deleteTableId_ = -1; Optional: could add a comment that it is set in 'analyze()'. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@51 PS3, Line 51: createSourceStmt I can't find the 'createSourceStmt()' that sets these variables. Aren't they set in 'buildAndValidateSelectExprs()'? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@75 PS3, Line 75: delete "update mode"? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@91 PS3, Line 91: being I don't think "being" is needed here. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@123 PS3, Line 123: String.format("Left-hand side in assignment '%s = %s' refers to a " + This error message is incorrect, it should be about 'c' being a complex type. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java File fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java@40 PS3, Line 40: createSourceStmt I can't find the 'createSourceStmt()' that sets these variables. Aren't they set in 'buildAndValidateSelectExprs()'? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java@51 PS3, Line 51: partitionKeyExprs_ Is this set anywhere? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java@110 PS3, Line 110: // TODO(Kudu) Add test for this code-path when Kudu supports nested types Maybe we should open a Jira for this? Even if Kudu supports nested types in the future, Impala doesn't support complex types in Iceberg, right? Shouldn't we return an error? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java File fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@69 PS3, Line 69: sortExprs_ I can't see that it is set in createSourceStmt(), or anywhere else in this class. It is set in {IcebergDeleteImpl,IcebergUpdateImpl}.buildAndValidateSelectExprs(). I can't see it used in KuduModifyImpl either. We could consider moving it to IcebergModifyImpl (or the two leaf classes) and making getSortExprs() abstract, or have it return an empty list by default. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@101 PS3, Line 101: sourceStmt_.analyze(analyzer); Is 'sourceStmt_.analyze(analyzer)' moved here because 'source Stmt_' should already be analyzed when addCastsToAssignmentsInSourceStmt() is called? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@124 PS3, Line 124: Superfluous spaces, we have an indentation of 4 before continuation lines. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java File fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java@47 PS3, Line 47: = 0 It is set in the constructor, I don't think it needs to be set here too. Also, it could be final. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@35 PS3, Line 35: * subset of columns than other data sinks. We could mention UPDATE as an example. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@41 PS3, Line 41: maxNumberOfSinks_ If it is the max number of MULTI sinks, the name should be 'maxNumberOfMultiSinks_'. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@50 PS3, Line 50: public void setFragment(PlanFragment fragment) { We could add a comment that this should be called after all child sinks have been added. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@129 PS3, Line 129: */ Shouldn't we mention 'maxNumberOfSinks_' here? Or is it connected to MAX_HDFS_WRITER? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@134 PS3, Line 134: // planned to run then, then the instances will be distributed evenly across them. "there are more nodes than instances" - can it only happen if 'maxNumberOfSinks_' is set? If not, the check is not enough. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@143 PS3, Line 143: * and has an upper limit set by the MAX_HDFS_WRITER query option. Shouldn't we mention 'maxNumberOfSinks_' here? Or is it connected to MAX_HDFS_WRITER? http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/service/Frontend.java File fe/src/main/java/org/apache/impala/service/Frontend.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/service/Frontend.java@2613 PS3, Line 2613: TIcebergDmlFinalizeParams iceFinalizeParams = new TIcebergDmlFinalizeParams(); >From this line the function is the same as the corresponding part of >addFinalizationParamsForDelete(), except for "iceFinalizeParams.operation". >This could be extracted into a function where 'operation' is a parameter. http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/util/IcebergUtil.java File fe/src/main/java/org/apache/impala/util/IcebergUtil.java: http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@1134 PS3, Line 1134: String tableName = targetTable.getFullName(); Nit: 'tableName' could come after 1135. http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test: http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test@1 PS3, Line 1: UPDATE iceberg_v2_no_deletes set s = concat(s,s) where i = 3 Shouldn't we also check that the updates have really been applied? Or is that the responsibility of the other tests? http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test: http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@176 PS3, Line 176: update ice_alltypes set bigint_col = 33, int_col = 3, string_col = 'three'; Is this query intentionally duplicated? http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@200 PS3, Line 200: update ice_alltypes set bigint_col=bi, string_col=s, date_col=d from ice_alltypes, ref_table where int_col = i; What would/should happen if in the join, for a row in ice_alltypes, there were multiple matches from the other table? Is it an error? http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@270 PS3, Line 270: ice_time_transforms Nit: this could be 'ice_time_transforms_t[ime]s[tamp]' to have the same logic as 'ice_time_transforms_date'. http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test File testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test: http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@750 PS3, Line 750: Updates This is a delete query, not an update. http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test File testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test: http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test@189 PS3, Line 189: Updates This is a delete query, not an update. http://gerrit.cloudera.org:8080/#/c/20677/3/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/20677/3/tests/query_test/test_iceberg.py@1193 PS3, Line 1193: if Instead of an IF, this could be "return vector.get_value('exec_option')['disable_optimized_iceberg_v2_read'] == 0". http://gerrit.cloudera.org:8080/#/c/20677/3/tests/stress/test_update_stress.py File tests/stress/test_update_stress.py: http://gerrit.cloudera.org:8080/#/c/20677/3/tests/stress/test_update_stress.py@96 PS3, Line 96: updates Nit could be 'updates_per_col'. http://gerrit.cloudera.org:8080/#/c/20677/3/tests/stress/test_update_stress.py@97 PS3, Line 97: 90 This could be written as 'updates[_per_col]' * 3, where 3 is the number of columns. -- To view, visit http://gerrit.cloudera.org:8080/20677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08 Gerrit-Change-Number: 20677 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Andrew Sherman <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Tamas Mate <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 30 Nov 2023 13:31:34 +0000 Gerrit-HasComments: Yes
