zclllyybb commented on issue #63640:
URL: https://github.com/apache/doris/issues/63640#issuecomment-4535781333

   Initial triage:
   
   This looks like a real 4.0.x external-reader DATE decoding bug, not an 
Iceberg snapshot mismatch. The issue shows every Iceberg `DATE` value shifted 
exactly one day earlier while the `TIMESTAMP` column stays aligned with Spark. 
In the 4.0.5 and 4.0.2-rc02 code path, Iceberg Parquet files are scanned 
through `IcebergParquetReader`, which delegates the physical DATE conversion to 
the shared Parquet converter. That converter maps Parquet logical `DATE` to 
Doris `DATEV2`, but then applies a session-timezone-derived `offset_days` when 
converting the stored epoch-day integer:
   
   - `be/src/vec/exec/scan/file_scanner.cpp`: Iceberg Parquet ranges are 
wrapped by `IcebergParquetReader`.
   - `be/src/vec/exec/format/table/iceberg_reader.cpp`: `IcebergParquetReader` 
calls the shared `ParquetReader`.
   - `be/src/vec/exec/format/parquet/schema_desc.cpp`: Parquet logical `DATE` 
is mapped to `TYPE_DATEV2`.
   - `be/src/vec/exec/format/parquet/parquet_column_convert.h`: 
`ConvertParams::init()` derives `offset_days` from `from_unixtime(0, 
session_timezone)`, and `Int32ToDate` adds that offset to the Parquet DATE day 
count.
   
   For a west-of-UTC session timezone, `from_unixtime(0, tz)` is `1969-12-31 
...`, so `offset_days = -1`; then a stored Iceberg DATE day count of `0` 
(`1970-01-01`) is decoded as day `-1` (`1969-12-31`). That matches the reported 
output exactly. DATE is a logical calendar day and should not depend on the 
query/session timezone.
   
   There is already a closely related upstream fix on master / 4.1: `#61722` 
(`[fix](hive) Fix Hive DATE timezone shift in external readers`). Although the 
title says Hive, the Parquet part removes the same shared `offset_days` 
adjustment from `be/src/format/parquet/parquet_column_convert.h`, so the same 
principle should apply to Iceberg Parquet DATE reads. I checked locally that 
this fix is not an ancestor of the reported `4.0.5` or `4.0.2-rc02` tags.
   
   Recommended next steps:
   
   1. Backport/apply the Parquet DATE part of `#61722` to the 4.0 branch, and 
confirm it covers Iceberg as well as Hive because both use the shared Parquet 
physical-to-logical DATE converter.
   2. Add an Iceberg regression case for DATE reads under at least two Doris 
time zones, e.g. `UTC` and a west timezone such as `America/Mexico_City` or 
`-06:00`, using the repro rows from this issue.
   3. Ask the reporter to confirm `select @@time_zone;`, the Iceberg data file 
format (`parquet` vs `orc`), and whether `set time_zone = 'UTC'` makes Doris 
return the Spark dates. This is not needed to see the code bug, but it will 
confirm the exact runtime trigger in their deployment.
   
   No code was changed in this triage.
   
   Breakwater-GitHub-Analysis-Slot: slot_aa7376560be6
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to