wombatu-kun opened a new pull request, #16619: URL: https://github.com/apache/iceberg/pull/16619
## Problem Filtering a `timestamp_ns` or `timestamptz_ns` column through the parquet-mr `ReadSupport` read path (`Parquet.read(...).callInit().filter(...)`, the path used when no Iceberg-native `createReaderFunc` is set) silently returned the wrong rows. A nanosecond predicate matched every row regardless of the filter boundary, so sub-microsecond filtering was effectively ignored. No exception was thrown. ## Root cause Two complementary gaps on the read path: - `MessageTypeToType` ignored the timestamp unit and always mapped a Parquet `INT64 TIMESTAMP(NANOS)` column back to Iceberg micros `TimestampType`. The filter path in `Parquet` builds its file schema via `ParquetSchemaUtil.convert(messageType)` and binds the user filter against it, so a nanosecond literal bound as microseconds (~1.7e15) while the column data is raw nanoseconds (~1.7e18). Every row trivially satisfied the predicate. - `ParquetFilters` had no `TIMESTAMP_NANO` case, so once the schema reported the correct nano type the conversion would instead throw `UnsupportedOperationException`. Both fixes are required: the first restores the correct bound type, the second pushes it down as an `INT64` (long) column predicate compared directly against the raw nanoseconds, which is exact with no unit conversion. This is the Parquet analog of the ORC fix in #16609. ## Changes - `MessageTypeToType`: honor `TimestampLogicalTypeAnnotation.getUnit()` - `NANOS` maps to `TimestampNanoType` (with/without zone), other units keep `TimestampType`. - `ParquetFilters`: add `case TIMESTAMP_NANO` to the long-column predicate group. ## Tests - `TestParquetSchemaUtil.testTimestampNanoConversionPreservesUnit` - a Parquet `TIMESTAMP(NANOS)` / `TIMESTAMPTZ(NANOS)` schema converts back to `timestamp_ns` / `timestamptz_ns`, with micros left unchanged. - `TestParquet.timestampNanoFilterRespectsNanoseconds` and `timestamptzNanoFilterAcrossTimezones` - end-to-end through the `ReadSupport` read path: five rows differing only by sub-microsecond nanoseconds filter to exactly the expected ids, including a multi-timezone `timestamptz_ns` variant where a row sits exactly on the filter boundary in a different offset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
