Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
......................................................................


Patch Set 3:

(50 comments)

http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@10
PS3, Line 10: any of
            : the followings are
Nit: "any of the following is".


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@46
PS3, Line 46:   Insert sink (i, s, 5)
Just an idea: How difficult / expensive would it be to check the rows we update 
whether the new row value differs from the old one? We could skip rows that 
already have the desired value.

Even if it's worth the effort it should probably be in a different change.


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@66
PS3, Line 66: Why this patch has
Nit: "Why does this patch have..." would be better.


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@80
PS3, Line 80: smaller problem
            :    then inefficient data files
Why is it a smaller problem? Is it because we expect that there is more data in 
data files compared to delete files?


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc
File be/src/exec/iceberg-delete-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc@117
PS3, Line 117: existing_partition
Not modified in this change, but does this variable do anything?


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc@137
PS3, Line 137: dml_exec_state_
Does 'dml_exec_state_' need to be the same as 'state->dml_exec_state()', which 
was used before this patch? If so, we could add a DCHECK.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.h
File be/src/exec/multi-table-sink.h:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.h@69
PS3, Line 69:   /// END.
I'd consider writing the explanatory comment of BEGIN also here ("Following 
methods just delegate...") because in case other methods are added it could 
become difficult to understand what END refers to.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc
File be/src/exec/multi-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc@52
PS3, Line 52: (TableSinkBase*)
Instead of a C-style cast we could use static_cast, maybe with a DCHECK 
containing a dynamic cast.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc@94
PS3, Line 94:   closed_ = true;
Shouldn't DataSink::Close() already set 'closed_' to true? We could also 
replace it with DCHECK(closed_).


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc
File be/src/exec/table-sink-base.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc@419
PS3, Line 419:   if (dml_exec_state == nullptr) dml_exec_state = 
state->dml_exec_state();
When is it possible that 'dml_exec_state' is not null and is not the same as 
'state->dml_exec_state()'? For example, is it possible if 'is_delete' is false?


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc@430
PS3, Line 430: IsIceberg()
If 'is_delete' is true, shouldn't this always be true also? We could add a 
DCHECK for it.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc
File be/src/runtime/dml-exec-state.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc@483
PS3, Line 483: ASDF
What does ASDF mean?


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc@565
PS3, Line 565: is_iceberg
Shouldn't 'is_iceberg' always be true for delete files?


http://gerrit.cloudera.org:8080/#/c/20677/3/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/20677/3/common/protobuf/control_service.proto@95
PS3, Line 95: 8
Did you intentionally skip 7?


http://gerrit.cloudera.org:8080/#/c/20677/3/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

http://gerrit.cloudera.org:8080/#/c/20677/3/common/thrift/DataSinks.thrift@187
PS3, Line 187:   10: optional list<TDataSink> data_sinks
Maybe 'child_data_sinks' would be better.
Could also add a comment that it is used for the children of MULTI_DATA_SINKs.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@65
PS3, Line 65:   private final Map<FeTable, Integer> additionalTargetTableIds_ = 
new HashMap<>();
Shouldn't this come directly after 'targetTable_'?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java@20
PS3, Line 20: import java.util.HashMap;
            : import java.util.HashSet;
Unused imports.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java@23
PS3, Line 23: import java.util.Map;
            : import java.util.Set;
            : import java.util.stream.Collectors;
Unused imports.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java@27
PS3, Line 27: import org.apache.impala.catalog.Column;
Unused import.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@36
PS3, Line 36: createSourceStmt
I can't find the 'createSourceStmt()' that sets these variables. Aren't they 
set in 'buildAndValidateSelectExprs()'?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@53
PS3, Line 53:     // Make the virtual position delete table the new target 
table.
This comment seems stale, the following line was deleted.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@48
PS3, Line 48:   private int deleteTableId_ = -1;
Optional: could add a comment that it is set in 'analyze()'.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@51
PS3, Line 51: createSourceStmt
I can't find the 'createSourceStmt()' that sets these variables. Aren't they 
set in 'buildAndValidateSelectExprs()'?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@75
PS3, Line 75: delete
"update mode"?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@91
PS3, Line 91: being
I don't think "being" is needed here.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@123
PS3, Line 123:             String.format("Left-hand side in assignment '%s = 
%s' refers to a " +
This error message is incorrect, it should be about 'c' being a complex type.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java@40
PS3, Line 40: createSourceStmt
I can't find the 'createSourceStmt()' that sets these variables. Aren't they 
set in 'buildAndValidateSelectExprs()'?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java@51
PS3, Line 51: partitionKeyExprs_
Is this set anywhere?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java@110
PS3, Line 110:       // TODO(Kudu) Add test for this code-path when Kudu 
supports nested types
Maybe we should open a Jira for this? Even if Kudu supports nested types in the 
future, Impala doesn't support complex types in Iceberg, right? Shouldn't we 
return an error?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@69
PS3, Line 69: sortExprs_
I can't see that it is set in createSourceStmt(), or anywhere else in this 
class. It is set in 
{IcebergDeleteImpl,IcebergUpdateImpl}.buildAndValidateSelectExprs(). I can't 
see it used in KuduModifyImpl either.

We could consider moving it to IcebergModifyImpl (or the two leaf classes) and 
making getSortExprs() abstract, or have it return an empty list by default.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@101
PS3, Line 101:       sourceStmt_.analyze(analyzer);
Is 'sourceStmt_.analyze(analyzer)' moved here because 'source Stmt_' should 
already be analyzed when addCastsToAssignmentsInSourceStmt() is called?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@124
PS3, Line 124:
Superfluous spaces, we have an indentation of 4 before continuation lines.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java
File fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java@47
PS3, Line 47:  = 0
It is set in the constructor, I don't think it needs to be set here too.
Also, it could be final.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java
File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@35
PS3, Line 35:  * subset of columns than other data sinks.
We could mention UPDATE as an example.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@41
PS3, Line 41: maxNumberOfSinks_
If it is the max number of MULTI sinks, the name should be 
'maxNumberOfMultiSinks_'.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@50
PS3, Line 50:   public void setFragment(PlanFragment fragment) {
We could add a comment that this should be called after all child sinks have 
been added.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@129
PS3, Line 129:    */
Shouldn't we mention 'maxNumberOfSinks_' here? Or is it connected to 
MAX_HDFS_WRITER?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@134
PS3, Line 134:       // planned to run then, then the instances will be 
distributed evenly across them.
"there are more nodes than instances" - can it only happen if 
'maxNumberOfSinks_' is set? If not, the check is not enough.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@143
PS3, Line 143:    * and has an upper limit set by the MAX_HDFS_WRITER query 
option.
Shouldn't we mention 'maxNumberOfSinks_' here? Or is it connected to 
MAX_HDFS_WRITER?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/service/Frontend.java
File fe/src/main/java/org/apache/impala/service/Frontend.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/service/Frontend.java@2613
PS3, Line 2613:     TIcebergDmlFinalizeParams iceFinalizeParams = new 
TIcebergDmlFinalizeParams();
>From this line the function is the same as the corresponding part of 
>addFinalizationParamsForDelete(), except for "iceFinalizeParams.operation". 
>This could be extracted into a function where 'operation' is a parameter.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
File fe/src/main/java/org/apache/impala/util/IcebergUtil.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@1134
PS3, Line 1134:     String tableName = targetTable.getFullName();
Nit: 'tableName' could come after 1135.


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test@1
PS3, Line 1: UPDATE iceberg_v2_no_deletes set s = concat(s,s) where i = 3
Shouldn't we also check that the updates have really been applied? Or is that 
the responsibility of the other tests?


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@176
PS3, Line 176: update ice_alltypes set bigint_col = 33, int_col = 3, string_col 
= 'three';
Is this query intentionally duplicated?


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@200
PS3, Line 200: update ice_alltypes set bigint_col=bi, string_col=s, date_col=d 
from ice_alltypes, ref_table where int_col = i;
What would/should happen if in the join, for a row in ice_alltypes, there were 
multiple matches from the other table? Is it an error?


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@270
PS3, Line 270: ice_time_transforms
Nit: this could be 'ice_time_transforms_t[ime]s[tamp]' to have the same logic 
as 'ice_time_transforms_date'.


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@750
PS3, Line 750: Updates
This is a delete query, not an update.


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test@189
PS3, Line 189: Updates
This is a delete query, not an update.


http://gerrit.cloudera.org:8080/#/c/20677/3/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/20677/3/tests/query_test/test_iceberg.py@1193
PS3, Line 1193: if
Instead of an IF, this could be "return 
vector.get_value('exec_option')['disable_optimized_iceberg_v2_read'] == 0".


http://gerrit.cloudera.org:8080/#/c/20677/3/tests/stress/test_update_stress.py
File tests/stress/test_update_stress.py:

http://gerrit.cloudera.org:8080/#/c/20677/3/tests/stress/test_update_stress.py@96
PS3, Line 96: updates
Nit could be 'updates_per_col'.


http://gerrit.cloudera.org:8080/#/c/20677/3/tests/stress/test_update_stress.py@97
PS3, Line 97: 90
This could be written as 'updates[_per_col]' * 3, where 3 is the number of 
columns.



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Andrew Sherman <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 30 Nov 2023 13:31:34 +0000
Gerrit-HasComments: Yes

Reply via email to