Re: Filtering on int96 timestamps

Ryan Blue Thu, 06 May 2021 10:12:10 -0700

Scott, can you tell me more about where this is failing?

I think we should definitely update Iceberg to ignore INT96 columns when
filtering. I'm surprised that it isn't already happening since we don't
have metadata for those columns. Sounds like it may be somewhere in our
Parquet row group or dictionary filters. A stack trace would help us track
it down if you can share one.


On Thu, May 6, 2021 at 7:15 AM Scott Kruger <[email protected]>
wrote:

> Is there any way to prevent the pushdown for int96 columns? At least that
> would prevent spark from crashing.
>
>
>
> (Aside: this is kind of a bummer, as I was able to get the unit test
> passing last night; I guess it wouldn’t work for all cases though).
>
>
>
> *From: *Ryan Blue <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Wednesday, May 5, 2021 at 7:40 PM
> *To: *"[email protected]" <[email protected]>
> *Cc: *"[email protected]" <[email protected]>
> *Subject: *Re: Filtering on int96 timestamps
>
>
>
> This message contains hyperlinks, take precaution before opening these
> links.
>
> This is actually quite difficult.
>
>
>
> The main problem is that Parquet stats are unreliable for INT96
> timestamps. That means that Iceberg won't have stats information for INT96
> either because we get them from Parquet. There is no metadata to use for
> filter pushdown, so I'm not sure what you could do. You might be able to
> build something to read all of the timestamps from a data file and produce
> a min/max range and then write that into Iceberg metadata directly. But
> that's a pretty low-level operation and I'm not sure it would be worth the
> investment.
>
>
>
> I think it would be easier just to rewrite the dataset to use the
> timestamp formats that are better supported. You'll get a lot more reliable
> results that way, although it takes a lot of up-front work.
>
>
>
> Ryan
>
>
>
> On Wed, May 5, 2021 at 3:45 PM Scott Kruger <[email protected]>
> wrote:
>
> I just submitted a github issue (
> https://github.com/apache/iceberg/issues/2553
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fissues%2F2553&data=04%7C01%7Csckruger%40paypal.com%7Ce96c8d125111408f5f8a08d9102778dc%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637558584422640190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ptjOqUq%2BMBny%2FYpdhnFw5VxKxKOE%2BBI3O3rRc1B7Yic%3D&reserved=0>)
> related to iceberg’s inability to filter on int96 timestamp columns. I’d
> like to take a crack at fixing this, but it feels like perhaps this is
> uncharted territory due to the backing type in iceberg (Long) not matching
> the backing type from the data (ByteArray).
>
>
>
> Is it appropriate to modify ParquetConversions.converterFromParquet to add
> special handling for icebergType = timestamp and parquetType = int96, or is
> a more fundamental change required?
>
>
>
>
> --
>
> Ryan Blue
>
> Tabular
>


-- 
Ryan Blue

Re: Filtering on int96 timestamps

Reply via email to