Re: Filtering on int96 timestamps

Ryan Blue Thu, 06 May 2021 10:34:38 -0700

Sorry, my bad for not checking out the github issue! Looks like we need to
update `ParquetMetricsRowGroupFilter` to ignore INT96 stats. I think we can
probably do that when getting the stats from file metadata, here:
https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java#L97


On Thu, May 6, 2021 at 10:30 AM Scott Kruger <[email protected]>
wrote:

> The github issue has a strack trace (as well as a diff to a unit test that
> illiustrates the problem).  I can add some more context if needed. The gist
> is that a ClassCastException is thrown because, during the filter step when
> it’s comparing the value against the upper/lower bounds, it wants to
> compare the literal value (a long from the timestamp filter) to a
> binary/bytearray value (because the converter for int96 returns a byte
> array).
>
>
>
> *From: *Ryan Blue <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Thursday, May 6, 2021 at 12:12 PM
> *To: *"[email protected]" <[email protected]>
> *Cc: *"[email protected]" <[email protected]>
> *Subject: *Re: Filtering on int96 timestamps
>
>
>
> This message contains hyperlinks, take precaution before opening these
> links.
>
> Scott, can you tell me more about where this is failing?
>
>
>
> I think we should definitely update Iceberg to ignore INT96 columns when
> filtering. I'm surprised that it isn't already happening since we don't
> have metadata for those columns. Sounds like it may be somewhere in our
> Parquet row group or dictionary filters. A stack trace would help us track
> it down if you can share one.
>
>
>
> On Thu, May 6, 2021 at 7:15 AM Scott Kruger <[email protected]>
> wrote:
>
> Is there any way to prevent the pushdown for int96 columns? At least that
> would prevent spark from crashing.
>
>
>
> (Aside: this is kind of a bummer, as I was able to get the unit test
> passing last night; I guess it wouldn’t work for all cases though).
>
>
>
> *From: *Ryan Blue <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Wednesday, May 5, 2021 at 7:40 PM
> *To: *"[email protected]" <[email protected]>
> *Cc: *"[email protected]" <[email protected]>
> *Subject: *Re: Filtering on int96 timestamps
>
>
>
> This message contains hyperlinks, take precaution before opening these
> links.
>
> This is actually quite difficult.
>
>
>
> The main problem is that Parquet stats are unreliable for INT96
> timestamps. That means that Iceberg won't have stats information for INT96
> either because we get them from Parquet. There is no metadata to use for
> filter pushdown, so I'm not sure what you could do. You might be able to
> build something to read all of the timestamps from a data file and produce
> a min/max range and then write that into Iceberg metadata directly. But
> that's a pretty low-level operation and I'm not sure it would be worth the
> investment.
>
>
>
> I think it would be easier just to rewrite the dataset to use the
> timestamp formats that are better supported. You'll get a lot more reliable
> results that way, although it takes a lot of up-front work.
>
>
>
> Ryan
>
>
>
> On Wed, May 5, 2021 at 3:45 PM Scott Kruger <[email protected]>
> wrote:
>
> I just submitted a github issue (
> https://github.com/apache/iceberg/issues/2553
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fissues%2F2553&data=04%7C01%7Csckruger%40paypal.com%7Ce96c8d125111408f5f8a08d9102778dc%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637558584422640190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ptjOqUq%2BMBny%2FYpdhnFw5VxKxKOE%2BBI3O3rRc1B7Yic%3D&reserved=0>)
> related to iceberg’s inability to filter on int96 timestamp columns. I’d
> like to take a crack at fixing this, but it feels like perhaps this is
> uncharted territory due to the backing type in iceberg (Long) not matching
> the backing type from the data (ByteArray).
>
>
>
> Is it appropriate to modify ParquetConversions.converterFromParquet to add
> special handling for icebergType = timestamp and parquetType = int96, or is
> a more fundamental change required?
>
>
>
>
> --
>
> Ryan Blue
>
> Tabular
>
>
>
>
> --
>
> Ryan Blue
>


-- 
Ryan Blue

Re: Filtering on int96 timestamps

Reply via email to