Daniel Vanko has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24137 )

Change subject: IMPALA-13810: Generate lineage records for UPDATE and MERGE 
statements
......................................................................


Patch Set 6:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/24137/5/fe/src/main/java/org/apache/impala/analysis/ColumnLineageGraph.java
File fe/src/main/java/org/apache/impala/analysis/ColumnLineageGraph.java:

http://gerrit.cloudera.org:8080/#/c/24137/5/fe/src/main/java/org/apache/impala/analysis/ColumnLineageGraph.java@482
PS5, Line 482: String.valueOf(tableName_))
> nit: String.valueOf(tableName_)
Done


http://gerrit.cloudera.org:8080/#/c/24137/5/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java
File fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java:

http://gerrit.cloudera.org:8080/#/c/24137/5/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java@130
PS5, Line 130: collectExprs
> If this method is only used for lineage, then it should be renamed to somet
Done


http://gerrit.cloudera.org:8080/#/c/24137/4/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/24137/4/fe/src/main/java/org/apache/impala/planner/Planner.java@297
PS4, Line 297: se {
> Instead of adding all columns, can we only add the target columns, just lik
Done


http://gerrit.cloudera.org:8080/#/c/24137/4/fe/src/main/java/org/apache/impala/planner/Planner.java@333
PS4, Line 333:
> Can these have different sizes? If not, can we precondition check?
No, I don't think they can have, changed to a precondition check.


http://gerrit.cloudera.org:8080/#/c/24137/5/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/24137/5/fe/src/main/java/org/apache/impala/planner/Planner.java@232
PS5, Line 232:     ctx_.getRootAnalyzer().logCacheStats();
> is it possible to move out this whole section as a separate method?
Done


http://gerrit.cloudera.org:8080/#/c/24137/5/fe/src/main/java/org/apache/impala/planner/Planner.java@322
PS5, Line 322: or non-Kudu tables (e.g., Iceberg), extract target columns f
> this could be a constant for MERGE
Done


http://gerrit.cloudera.org:8080/#/c/24137/4/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test
File 
testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test:

http://gerrit.cloudera.org:8080/#/c/24137/4/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test@1044
PS4, Line 1044:         "tableName": 
"functional_parquet.iceberg_non_partitioned",
> Can we have more MERGE tests with
Done


http://gerrit.cloudera.org:8080/#/c/24137/4/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test@1448
PS4, Line 1448:         "tableName": 
"lineage_dml_test_db.iceberg_non_partitioned",
> Impala also has UPDATE FROM statements, e.g.:
Thanks, added those as well, for Kudu and Iceberg also.


http://gerrit.cloudera.org:8080/#/c/24137/5/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test
File 
testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test:

http://gerrit.cloudera.org:8080/#/c/24137/5/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test@1
PS5, Line 1: ====
           : ---- QUERY
           : drop database if exists lineage_dml_test_db cascade
           : ====
           : ---- QUERY
           : create database lineage_dml_test_db;
           : ====
> Using unique_database on run_test_case could solve this automatically, alth
Yeah, I think it's easier to handle and maintain like this, without the 
$DATABASE injections.


http://gerrit.cloudera.org:8080/#/c/24137/5/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test@157
PS5, Line 157: "operationType": "UPDATE"
> For the UPDATE queries, conceptually the source and destination tables corr
Not really, I fixed it, so we have only one table.


http://gerrit.cloudera.org:8080/#/c/24137/5/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test@198
PS5, Line 198:     "tableName": "lineage_dml_test_db.iceberg_non_partitioned",
             :         "tableType": "iceberg",
             :         "tableCreateTime": 1775747129
             :       }
             :     }
             :   ]
             : }
             : ---- QUERY
             : # test update statements on Iceberg tables
             : update lineage_dml_test_db.iceberg_non_partitioned set user = 
"Dani" where action = "click"
             : ====
             : ---- LINEAGE
             : {
> This is just for my own understanding since I do not know enough about the
This is an optimisation for update statements. We also consider the user 
column, because if it's already equals "Dani" in this case, we won't modify 
that row. I think technically we add a "where user != "Dani"" predicate that's 
why it will appear as a source as well.


http://gerrit.cloudera.org:8080/#/c/24137/5/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test@582
PS5, Line 582:     "edgeType": "PROJECTION"
> In this query, we have one source table, and one destination table. Thus a
Deduplicated those by using fully qualified column names.


http://gerrit.cloudera.org:8080/#/c/24137/5/testdata/workloads/functional-query/queries/QueryTest/lineage-update-merge.test@1448
PS5, Line 1448:       "tableName": 
"lineage_dml_test_db.iceberg_non_partitioned",
              :         "tableType": "icebe
> It looks like for the UPDATE queries against Kudu tables, we do not list al
Updated the Iceberg part to be consistent about not listing all the columns, 
only those which are referenced.


http://gerrit.cloudera.org:8080/#/c/24137/5/tests/custom_cluster/test_lineage.py
File tests/custom_cluster/test_lineage.py:

http://gerrit.cloudera.org:8080/#/c/24137/5/tests/custom_cluster/test_lineage.py@196
PS5, Line 196: @SkipIfFS.hbase
> Is this needed?
Well, I copied this from the test_lineage_output() and it was originally added 
by https://gerrit.cloudera.org/c/14230/
I'll try to test if this is needed.



--
To view, visit http://gerrit.cloudera.org:8080/24137
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icba79f756509438455bfbf3067733f6f29284220
Gerrit-Change-Number: 24137
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Vanko <[email protected]>
Gerrit-Reviewer: Daniel Vanko <[email protected]>
Gerrit-Reviewer: Fang-Yu Rao <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 08 Jun 2026 09:44:33 +0000
Gerrit-HasComments: Yes

Reply via email to