Is there any way to prevent the pushdown for int96 columns? At least that would prevent spark from crashing.
(Aside: this is kind of a bummer, as I was able to get the unit test passing last night; I guess it wouldn’t work for all cases though). From: Ryan Blue <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, May 5, 2021 at 7:40 PM To: "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]> Subject: Re: Filtering on int96 timestamps This message contains hyperlinks, take precaution before opening these links. This is actually quite difficult. The main problem is that Parquet stats are unreliable for INT96 timestamps. That means that Iceberg won't have stats information for INT96 either because we get them from Parquet. There is no metadata to use for filter pushdown, so I'm not sure what you could do. You might be able to build something to read all of the timestamps from a data file and produce a min/max range and then write that into Iceberg metadata directly. But that's a pretty low-level operation and I'm not sure it would be worth the investment. I think it would be easier just to rewrite the dataset to use the timestamp formats that are better supported. You'll get a lot more reliable results that way, although it takes a lot of up-front work. Ryan On Wed, May 5, 2021 at 3:45 PM Scott Kruger <[email protected]> wrote: I just submitted a github issue (https://github.com/apache/iceberg/issues/2553<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fissues%2F2553&data=04%7C01%7Csckruger%40paypal.com%7Ce96c8d125111408f5f8a08d9102778dc%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637558584422640190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ptjOqUq%2BMBny%2FYpdhnFw5VxKxKOE%2BBI3O3rRc1B7Yic%3D&reserved=0>) related to iceberg’s inability to filter on int96 timestamp columns. I’d like to take a crack at fixing this, but it feels like perhaps this is uncharted territory due to the backing type in iceberg (Long) not matching the backing type from the data (ByteArray). Is it appropriate to modify ParquetConversions.converterFromParquet to add special handling for icebergType = timestamp and parquetType = int96, or is a more fundamental change required? -- Ryan Blue Tabular
