Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20951 )

Change subject: IMPALA-12598: Allow multiple equality field id lists for 
Iceberg tables
......................................................................


Patch Set 4:

(12 comments)

Thanks for taking a look Tamas and Zoli!

About your question, Tamas: the eq IDs were stored on a ContentFileStore level 
when there was this restriction that all the eq-delete files have to have the 
same eq ID lists. However, with this change I had to move them from the 
ContentFileStore to a per FileDescriptor level, and that is stored in FB. So 
it's not a question of Thrift vs FB rather on what abstraction level we want to 
store these IDs.

http://gerrit.cloudera.org:8080/#/c/20951/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20951/3//COMMIT_MSG@7
PS3, Line 7: d
> nit: d
Done


http://gerrit.cloudera.org:8080/#/c/20951/3//COMMIT_MSG@9
PS3, Line 9: hav
> nit: have
Done


http://gerrit.cloudera.org:8080/#/c/20951/3/common/fbs/IcebergObjects.fbs
File common/fbs/IcebergObjects.fbs:

http://gerrit.cloudera.org:8080/#/c/20951/3/common/fbs/IcebergObjects.fbs@53
PS3, Line 53: equality_field_ids
> I saw this in multiple places, the new term became equality_fiel_ids, in ma
Flink refers to these IDs as equality field IDs, while the Iceberg spec as 
equality IDs. Also in the vxmetadata.json in the Iceberg metadata this is 
called 'identifier-field-ids' so even Iceberg is not consistent with the naming.
I think they can be used as synonyms, and I believe I spelled the whole 
'equalityFiledID' out in most of the cases, while where I found a variable name 
too long I decided to omit the 'field' part. I feel that both naming are 
straightforward to understand.


http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@126
PS3, Line 126:   private Set<Integer> allEqualityFieldIds_ = new HashSet<>();
             :   private Map<List<Integer>, Set<FileDescriptor>> 
equalityIdsToDeleteFiles_ =
             :       new HashMap<>();
> nit: Here we use both equality ids and equality field ids.
well, I found equalityFieldIdsToDeleteFiles_ too long for a member name so 
decided to shorten it a bit. It's still easy to understand what we mean by 
equalityIds I think.


http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@136
PS3, Line 136: field
> typo: field
Done


http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@510
PS3, Line 510:
> nit: D
Done


http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@520
PS3, Line 520:
> This way [11, 12] would come earlier than [2, 3]. It shouldn't make much di
With this ordering I figured that the plan for a query should be deterministic 
and we shouldn't see a change in the order of the JOINs for the eq-deletes. For 
planner tests this would eliminate flakiness for example. So [11] coming before 
[2] isn't really an issue, only looks strange.

Anyway, you're right, a custom Comparator makes sense here and then I won't 
need to do these string conversions back and forth.


http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/data/README@884
PS3, Line 884: 4-Impala:
> You also mention NiFi in the commit message, but here I don't see any refer
When I created these table Nifi's PutIcebergCDC processor wrote all the columns 
into the eq-delete files so for these tests I only had Flink. However, since 
then there is a new version of the mentioned Nifi processor that writes onlt 
the identifier fields (PKs) into the eq-deletes so I can rewrite now one of the 
tables to use Nifi as the writer.


http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/datasets/functional/functional_schema_template.sql@3643
PS3, Line 3643:
> Do we need this? This is the default for Impala.
I don't think this is needed indeed. Some other tables has this as well, 
removed those too.


http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test:

http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@1396
PS3, Line 1396: ns=1/1 files=
> All EQ delete group has cardinality=2, which means we don't have test for t
Reworked iceberg_v2_delete_equality_multi_eq_ids  to serve this purpose, also 
added a time travel for this table.


http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@1543
PS3, Line 1543: |
> Would be nice to see some time travel plans.
Done


http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-equality-deletes.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-equality-deletes.test:

http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-equality-deletes.test@114
PS3, Line 114: g_v2_
> nit: field
Wow, how many mistakes I've made in this single sentence :)



--
To view, visit http://gerrit.cloudera.org:8080/20951
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e52d7a5800bf1b479f0c234679be92442d09f79
Gerrit-Change-Number: 20951
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 08 Feb 2024 15:09:47 +0000
Gerrit-HasComments: Yes

Reply via email to