Hello Gabor Kaszab, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/20711

to look at the new patch set (#2).

Change subject: IMPALA-12557: DELETE throws DateTimeParseException when 
deleting from time-partitioned table
......................................................................

IMPALA-12557: DELETE throws DateTimeParseException when deleting from 
time-partitioned table

There's a bug in IcebergDeleteSink that prevents Impala from
successfully executing a DELETE operation on Iceberg tables. During
DELETE we retrieve partition values from the virtual column
ICEBERG__PARTITION__SERIALIZED. This contains the transformed values,
e.g. in case of DAY-partitioning it contains the number of days since
the UNIX epoch.

Currently IcebergDeleteSink just uses these values as they are.
There are two problems with this. First, we want to place the delete
files under human-readable partition directories like other engines
do, and like our own INSERT statement does. I.e. we want a partition
directory /ts_day=2023-11-11/ and not /ts_day=19672/. The other problem
is that 'IcebergUtil.partitionDataFromDataFile()' also expects the
human-readable representation. This could be resolved at the CatalogD
side to just accept the integer values, but then we would still need
some logic in the IcebergDeleteSink to generate the human-readable
values for the file paths.

Moreover, partition values from INSERT statements are also
received in the human-readable representation at the Catalog.

This patch fixes the error by adding functions that transforms the
partition values to their human-readable representations. This is
done in the IcebergDeleteSink, so the Catalog-side logic is not
affected.

The above only affects the time-based transforms (YEAR, MONTH, DAY,
HOUR), as other partition transform values don't use different
representations.

Some notes on HOUR transform and daylight saving time:
There is no 1:1 mapping between an offset and the human-readable
representation in a timezone that has daylight saving time. This is not
an issue, as Impala's TIMESTAMP type is timezone-less. This also won't
be an issue for the TIMESTAMPTZ type as timestamp values are normalized
to UTC when stored, and UTC doesn't have daylight saving time.

Testing:
 * C++ backend tests
 * E2E tests for all time-based transforms, and also partition evolution
 * Also added an extra test about TRUNCATEing numeric values which was
   untested

Change-Id: I1cfeaed6409289663eb0f65b1ee2ecebd93e6118
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/util/CMakeLists.txt
A be/src/util/iceberg-utility-functions-test.cc
A be/src/util/iceberg-utility-functions.cc
A be/src/util/iceberg-utility-functions.h
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-delete-partitioned.test
13 files changed, 690 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/20711/2
--
To view, visit http://gerrit.cloudera.org:8080/20711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1cfeaed6409289663eb0f65b1ee2ecebd93e6118
Gerrit-Change-Number: 20711
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

Reply via email to