Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/20711
Change subject: IMPALA-12557: DELETE throws DateTimeParseException when deleting from time-partitioned table ...................................................................... IMPALA-12557: DELETE throws DateTimeParseException when deleting from time-partitioned table There's a bug in IcebergDeleteSink that prevents Impala from successfully executing a DELETE operation on Iceberg tables. During DELETE we retrieve partition values from the virtual column ICEBERG__PARTITION__SERIALIZED. This contains the transformed values, e.g. in case of DAY-partitioning it contains the number of days since the UNIX epoch. Currently IcebergDeleteSink just uses these values as they are. There are two problems with this. First, we want to place the delete files under human-readable partition directories like other engines do, and like our own INSERT statement does. I.e. we want a partition directory /ts_day=2023-11-11/ and not /ts_day=19672/. The other problem is that 'IcebergUtil.partitionDataFromDataFile()' also expects the human-readable representation. This could be resolved at the CatalogD side to just accept the integer values, but then we would still need some logic in the IcebergDeleteSink to generate the human-readable values for the file paths. Moreover, partition values from INSERT statements are also received in the human-readable representation at the Catalog. This patch fixes the error by adding functions that transforms the partition values to their human-readable representations. This is done in the IcebergDeleteSink, so the Catalog-side logic is not affected. The above only affects the time-based transforms (YEAR, MONTH, DAY, HOUR), as other partition transform values don't use different representations. Some notes on HOUR transform and daylight saving time: There is no 1:1 mapping between an offset and the human-readable representation in a timezone that has daylight saving time. This is not an issue, as Impala's TIMESTAMP type is timezone-less. This also won't be an issue for the TIMESTAMPTZ type as timestamp values are normalized to UTC when stored, and UTC doesn't have daylight saving time. Testing: * C++ backend tests * E2E tests for all time-based transforms, and also partition evolution * Also added an extra test about TRUNCATEing numeric values which was untested Change-Id: I1cfeaed6409289663eb0f65b1ee2ecebd93e6118 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/iceberg-delete-sink.cc M be/src/exec/iceberg-delete-sink.h M be/src/exec/table-sink-base.cc M be/src/exec/table-sink-base.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/util/CMakeLists.txt A be/src/util/iceberg-utility-functions-test.cc A be/src/util/iceberg-utility-functions.cc A be/src/util/iceberg-utility-functions.h M testdata/workloads/functional-query/queries/QueryTest/iceberg-delete-partitioned.test 13 files changed, 689 insertions(+), 26 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/20711/1 -- To view, visit http://gerrit.cloudera.org:8080/20711 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I1cfeaed6409289663eb0f65b1ee2ecebd93e6118 Gerrit-Change-Number: 20711 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
