mbutrovich opened a new issue, #2299:
URL: https://github.com/apache/iceberg-rust/issues/2299

   ### Describe the bug
   
   iceberg-rust reads INT96 timestamps incorrectly, resulting in ~1170 year 
offset for dates outside the nanosecond i64 range (~1677-2262).
   
   **Example:**
   - Correct (Iceberg Java): `3332-12-14 11:33:10.965`
   - iceberg-rust: `2163-11-05 13:24:03.545896`
   
   This affects migrated tables where Parquet files were written with INT96 
timestamps (common for Spark/Hive migrations via `add_files` or 
`importSparkTable`).
   
   ### Root Cause
   
   #### INT96 in Parquet
   INT96 is 12 bytes: 8 bytes of nanoseconds-within-day + 4 bytes of Julian day 
number.
   
   #### What happens today
   
   1. arrow-rs defaults INT96 to `Timestamp(Nanosecond, None)` 
([`parquet/src/arrow/schema/primitive.rs:122`](https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/schema/primitive.rs#L122))
   2. For dates outside ~1677-2262, nanoseconds-since-epoch overflows i64, 
producing garbage values
   3. iceberg-rust's `RecordBatchTransformer` later casts to 
`Timestamp(Microsecond)` to match the Iceberg schema, but the data is already 
corrupted by overflow
   4. [arrow-rs PR #7285](https://github.com/apache/arrow-rs/pull/7285) added 
support for reading INT96 as other TimeUnits — if you pass 
`Timestamp(Microsecond)` via `ArrowReaderOptions::with_schema()`, arrow-rs 
converts correctly without overflow
   
   #### Why iceberg-rust doesn't pass the right schema hint
   
   In `reader.rs`, the schema is only overridden via 
`ArrowReaderOptions::with_schema()` when Parquet files lack field IDs (branches 
2/3 of the schema resolution strategy). Even then, the overridden schema is 
derived from the Parquet file metadata — which has
   `Timestamp(Nanosecond)` for INT96 columns — not from the Iceberg table 
schema which correctly specifies `Timestamp(Microsecond)`.
   
   For files with embedded field IDs (branch 1), no schema override is passed 
at all.
   
   ### How Iceberg Java handles this
   
   Iceberg Java avoids this entirely by using a **custom INT96 column reader** 
that bypasses parquet-mr's default decoding. The reader factory receives the 
Iceberg expected schema as the authority via 
`readerFuncWithSchema.apply(expectedSchema, fileType)` 
([`Parquet.java:1366-1371`](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L1366-L1371)).
   
   When `BaseParquetReaders.primitive()` encounters INT96, it dispatches to a 
`TimestampInt96Reader` that reads the raw 12 bytes and converts safely:
   
   ```java
   // GenericParquetReaders.java:172-191
   final ByteBuffer byteBuffer = 
column.nextBinary().toByteBuffer().order(ByteOrder.LITTLE_ENDIAN);
   final long timeOfDayNanos = byteBuffer.getLong();
   final int julianDay = byteBuffer.getInt();
   
   return Instant.ofEpochMilli(TimeUnit.DAYS.toMillis(julianDay - 
UNIX_EPOCH_JULIAN))
       .plusNanos(timeOfDayNanos)
       .atOffset(ZoneOffset.UTC);
   ```
   
   This avoids overflow by keeping days and nanos separate — it never tries to 
cram the full value into a single i64 nanoseconds-since-epoch.
   
   iceberg-rust can't easily replicate this custom column reader approach since 
it delegates to arrow-rs for Parquet reading. The equivalent fix is to pass the 
correct schema hint so arrow-rs decodes INT96 as microseconds.
   
   ### Proposed Fix
   
   When building the Arrow schema to pass to 
`ArrowReaderOptions::with_schema()`, overlay the Iceberg table schema's 
timestamp types onto the Parquet-derived schema. For any column where:
   - The Parquet physical type is INT96
   - The Iceberg type is Timestamp or Timestamptz
   
   Replace `Timestamp(Nanosecond, ...)` with `Timestamp(Microsecond, ...)` in 
the schema hint. This triggers arrow-rs's INT96 conversion logic from [PR 
#7285](https://github.com/apache/arrow-rs/pull/7285).
   
   This is the same approach DataFusion uses via its 
`coerce_int96_to_resolution()` function ([datafusion PR 
#15537](https://github.com/apache/datafusion/pull/15537)), except the source of 
truth for the target TimeUnit is the Iceberg schema rather than a user config.
   
   #### Files to modify
   
   1. `crates/iceberg/src/arrow/reader.rs`
      - After building the Arrow schema from Parquet metadata, walk INT96 
timestamp columns and replace their types with the Iceberg schema's timestamp 
type
      - This applies to all three branches of the schema resolution strategy 
(with/without field IDs, with/without name mapping)
   
   ### Related
   
   - [arrow-rs #7285](https://github.com/apache/arrow-rs/pull/7285): Support 
different TimeUnits and timezones when reading Timestamps from INT96
   - [datafusion #15537](https://github.com/apache/datafusion/pull/15537): 
INT96 handling in DataFusion
   - [datafusion-comet 
#3856](https://github.com/apache/datafusion-comet/issues/3856): Downstream 
issue in Comet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to