Re: Filtering on int96 timestamps

Ryan Blue Wed, 05 May 2021 17:40:02 -0700

This is actually quite difficult.

The main problem is that Parquet stats are unreliable for INT96 timestamps.
That means that Iceberg won't have stats information for INT96 either
because we get them from Parquet. There is no metadata to use for filter
pushdown, so I'm not sure what you could do. You might be able to build
something to read all of the timestamps from a data file and produce a
min/max range and then write that into Iceberg metadata directly. But
that's a pretty low-level operation and I'm not sure it would be worth the
investment.

I think it would be easier just to rewrite the dataset to use the timestamp
formats that are better supported. You'll get a lot more reliable results
that way, although it takes a lot of up-front work.

Ryan

On Wed, May 5, 2021 at 3:45 PM Scott Kruger <[email protected]>
wrote:

> I just submitted a github issue (
> https://github.com/apache/iceberg/issues/2553) related to iceberg’s
> inability to filter on int96 timestamp columns. I’d like to take a crack at
> fixing this, but it feels like perhaps this is uncharted territory due to
> the backing type in iceberg (Long) not matching the backing type from the
> data (ByteArray).
>
>
>
> Is it appropriate to modify ParquetConversions.converterFromParquet to add
> special handling for icebergType = timestamp and parquetType = int96, or is
> a more fundamental change required?
>

-- 
Ryan Blue
Tabular

Re: Filtering on int96 timestamps

Reply via email to