Re: Filtering on int96 timestamps

Scott Kruger Thu, 06 May 2021 07:15:22 -0700

Is there any way to prevent the pushdown for int96 columns? At least that would 
prevent spark from crashing.

(Aside: this is kind of a bummer, as I was able to get the unit test passing 
last night; I guess it wouldn’t work for all cases though).

From: Ryan Blue <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, May 5, 2021 at 7:40 PM
To: "[email protected]" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: Filtering on int96 timestamps

This message contains hyperlinks, take precaution before opening these links.
This is actually quite difficult.

The main problem is that Parquet stats are unreliable for INT96 timestamps. 
That means that Iceberg won't have stats information for INT96 either because 
we get them from Parquet. There is no metadata to use for filter pushdown, so 
I'm not sure what you could do. You might be able to build something to read 
all of the timestamps from a data file and produce a min/max range and then 
write that into Iceberg metadata directly. But that's a pretty low-level 
operation and I'm not sure it would be worth the investment.

I think it would be easier just to rewrite the dataset to use the timestamp 
formats that are better supported. You'll get a lot more reliable results that 
way, although it takes a lot of up-front work.

Ryan

On Wed, May 5, 2021 at 3:45 PM Scott Kruger <[email protected]> wrote:
I just submitted a github issue 
(https://github.com/apache/iceberg/issues/2553<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fissues%2F2553&data=04%7C01%7Csckruger%40paypal.com%7Ce96c8d125111408f5f8a08d9102778dc%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637558584422640190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ptjOqUq%2BMBny%2FYpdhnFw5VxKxKOE%2BBI3O3rRc1B7Yic%3D&reserved=0>)
 related to iceberg’s inability to filter on int96 timestamp columns. I’d like 
to take a crack at fixing this, but it feels like perhaps this is uncharted 
territory due to the backing type in iceberg (Long) not matching the backing 
type from the data (ByteArray).

Is it appropriate to modify ParquetConversions.converterFromParquet to add 
special handling for icebergType = timestamp and parquetType = int96, or is a 
more fundamental change required?

--
Ryan Blue
Tabular

Re: Filtering on int96 timestamps

Reply via email to