Re: Filtering on int96 timestamps

Scott Kruger Thu, 06 May 2021 10:31:07 -0700

The github issue has a strack trace (as well as a diff to a unit test that 
illiustrates the problem).  I can add some more context if needed. The gist is 
that a ClassCastException is thrown because, during the filter step when it’s 
comparing the value against the upper/lower bounds, it wants to compare the 
literal value (a long from the timestamp filter) to a binary/bytearray value 
(because the converter for int96 returns a byte array).

From: Ryan Blue <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Thursday, May 6, 2021 at 12:12 PM
To: "[email protected]" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: Filtering on int96 timestamps

This message contains hyperlinks, take precaution before opening these links.
Scott, can you tell me more about where this is failing?

I think we should definitely update Iceberg to ignore INT96 columns when 
filtering. I'm surprised that it isn't already happening since we don't have 
metadata for those columns. Sounds like it may be somewhere in our Parquet row 
group or dictionary filters. A stack trace would help us track it down if you 
can share one.

On Thu, May 6, 2021 at 7:15 AM Scott Kruger <[email protected]> wrote:
Is there any way to prevent the pushdown for int96 columns? At least that would 
prevent spark from crashing.

(Aside: this is kind of a bummer, as I was able to get the unit test passing 
last night; I guess it wouldn’t work for all cases though).

From: Ryan Blue <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, May 5, 2021 at 7:40 PM
To: "[email protected]" <[email protected]>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Filtering on int96 timestamps

This message contains hyperlinks, take precaution before opening these links.
This is actually quite difficult.

The main problem is that Parquet stats are unreliable for INT96 timestamps. 
That means that Iceberg won't have stats information for INT96 either because 
we get them from Parquet. There is no metadata to use for filter pushdown, so 
I'm not sure what you could do. You might be able to build something to read 
all of the timestamps from a data file and produce a min/max range and then 
write that into Iceberg metadata directly. But that's a pretty low-level 
operation and I'm not sure it would be worth the investment.

I think it would be easier just to rewrite the dataset to use the timestamp 
formats that are better supported. You'll get a lot more reliable results that 
way, although it takes a lot of up-front work.

Ryan

On Wed, May 5, 2021 at 3:45 PM Scott Kruger <[email protected]> wrote:
I just submitted a github issue 
(https://github.com/apache/iceberg/issues/2553<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fissues%2F2553&data=04%7C01%7Csckruger%40paypal.com%7Ce96c8d125111408f5f8a08d9102778dc%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637558584422640190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ptjOqUq%2BMBny%2FYpdhnFw5VxKxKOE%2BBI3O3rRc1B7Yic%3D&reserved=0>)
 related to iceberg’s inability to filter on int96 timestamp columns. I’d like 
to take a crack at fixing this, but it feels like perhaps this is uncharted 
territory due to the backing type in iceberg (Long) not matching the backing 
type from the data (ByteArray).

Is it appropriate to modify ParquetConversions.converterFromParquet to add 
special handling for icebergType = timestamp and parquetType = int96, or is a 
more fundamental change required?

--
Ryan Blue
Tabular

--
Ryan Blue

Re: Filtering on int96 timestamps

Reply via email to