Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/20951 )
Change subject: IMPALA-12598: Allow multiple equality field id lists for Iceberg tables ...................................................................... Patch Set 4: (12 comments) Thanks for taking a look Tamas and Zoli! About your question, Tamas: the eq IDs were stored on a ContentFileStore level when there was this restriction that all the eq-delete files have to have the same eq ID lists. However, with this change I had to move them from the ContentFileStore to a per FileDescriptor level, and that is stored in FB. So it's not a question of Thrift vs FB rather on what abstraction level we want to store these IDs. http://gerrit.cloudera.org:8080/#/c/20951/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20951/3//COMMIT_MSG@7 PS3, Line 7: d > nit: d Done http://gerrit.cloudera.org:8080/#/c/20951/3//COMMIT_MSG@9 PS3, Line 9: hav > nit: have Done http://gerrit.cloudera.org:8080/#/c/20951/3/common/fbs/IcebergObjects.fbs File common/fbs/IcebergObjects.fbs: http://gerrit.cloudera.org:8080/#/c/20951/3/common/fbs/IcebergObjects.fbs@53 PS3, Line 53: equality_field_ids > I saw this in multiple places, the new term became equality_fiel_ids, in ma Flink refers to these IDs as equality field IDs, while the Iceberg spec as equality IDs. Also in the vxmetadata.json in the Iceberg metadata this is called 'identifier-field-ids' so even Iceberg is not consistent with the naming. I think they can be used as synonyms, and I believe I spelled the whole 'equalityFiledID' out in most of the cases, while where I found a variable name too long I decided to omit the 'field' part. I feel that both naming are straightforward to understand. http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@126 PS3, Line 126: private Set<Integer> allEqualityFieldIds_ = new HashSet<>(); : private Map<List<Integer>, Set<FileDescriptor>> equalityIdsToDeleteFiles_ = : new HashMap<>(); > nit: Here we use both equality ids and equality field ids. well, I found equalityFieldIdsToDeleteFiles_ too long for a member name so decided to shorten it a bit. It's still easy to understand what we mean by equalityIds I think. http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@136 PS3, Line 136: field > typo: field Done http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@510 PS3, Line 510: > nit: D Done http://gerrit.cloudera.org:8080/#/c/20951/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@520 PS3, Line 520: > This way [11, 12] would come earlier than [2, 3]. It shouldn't make much di With this ordering I figured that the plan for a query should be deterministic and we shouldn't see a change in the order of the JOINs for the eq-deletes. For planner tests this would eliminate flakiness for example. So [11] coming before [2] isn't really an issue, only looks strange. Anyway, you're right, a custom Comparator makes sense here and then I won't need to do these string conversions back and forth. http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/data/README@884 PS3, Line 884: 4-Impala: > You also mention NiFi in the commit message, but here I don't see any refer When I created these table Nifi's PutIcebergCDC processor wrote all the columns into the eq-delete files so for these tests I only had Flink. However, since then there is a new version of the mentioned Nifi processor that writes onlt the identifier fields (PKs) into the eq-deletes so I can rewrite now one of the tables to use Nifi as the writer. http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/datasets/functional/functional_schema_template.sql@3643 PS3, Line 3643: > Do we need this? This is the default for Impala. I don't think this is needed indeed. Some other tables has this as well, removed those too. http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test: http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@1396 PS3, Line 1396: ns=1/1 files= > All EQ delete group has cardinality=2, which means we don't have test for t Reworked iceberg_v2_delete_equality_multi_eq_ids to serve this purpose, also added a time travel for this table. http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@1543 PS3, Line 1543: | > Would be nice to see some time travel plans. Done http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-equality-deletes.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-equality-deletes.test: http://gerrit.cloudera.org:8080/#/c/20951/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-equality-deletes.test@114 PS3, Line 114: g_v2_ > nit: field Wow, how many mistakes I've made in this single sentence :) -- To view, visit http://gerrit.cloudera.org:8080/20951 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3e52d7a5800bf1b479f0c234679be92442d09f79 Gerrit-Change-Number: 20951 Gerrit-PatchSet: 4 Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 08 Feb 2024 15:09:47 +0000 Gerrit-HasComments: Yes