The github issue has a strack trace (as well as a diff to a unit test that illiustrates the problem). I can add some more context if needed. The gist is that a ClassCastException is thrown because, during the filter step when it’s comparing the value against the upper/lower bounds, it wants to compare the literal value (a long from the timestamp filter) to a binary/bytearray value (because the converter for int96 returns a byte array).
From: Ryan Blue <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, May 6, 2021 at 12:12 PM To: "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]> Subject: Re: Filtering on int96 timestamps This message contains hyperlinks, take precaution before opening these links. Scott, can you tell me more about where this is failing? I think we should definitely update Iceberg to ignore INT96 columns when filtering. I'm surprised that it isn't already happening since we don't have metadata for those columns. Sounds like it may be somewhere in our Parquet row group or dictionary filters. A stack trace would help us track it down if you can share one. On Thu, May 6, 2021 at 7:15 AM Scott Kruger <[email protected]> wrote: Is there any way to prevent the pushdown for int96 columns? At least that would prevent spark from crashing. (Aside: this is kind of a bummer, as I was able to get the unit test passing last night; I guess it wouldn’t work for all cases though). From: Ryan Blue <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, May 5, 2021 at 7:40 PM To: "[email protected]" <[email protected]> Cc: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Filtering on int96 timestamps This message contains hyperlinks, take precaution before opening these links. This is actually quite difficult. The main problem is that Parquet stats are unreliable for INT96 timestamps. That means that Iceberg won't have stats information for INT96 either because we get them from Parquet. There is no metadata to use for filter pushdown, so I'm not sure what you could do. You might be able to build something to read all of the timestamps from a data file and produce a min/max range and then write that into Iceberg metadata directly. But that's a pretty low-level operation and I'm not sure it would be worth the investment. I think it would be easier just to rewrite the dataset to use the timestamp formats that are better supported. You'll get a lot more reliable results that way, although it takes a lot of up-front work. Ryan On Wed, May 5, 2021 at 3:45 PM Scott Kruger <[email protected]> wrote: I just submitted a github issue (https://github.com/apache/iceberg/issues/2553<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fissues%2F2553&data=04%7C01%7Csckruger%40paypal.com%7Ce96c8d125111408f5f8a08d9102778dc%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637558584422640190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ptjOqUq%2BMBny%2FYpdhnFw5VxKxKOE%2BBI3O3rRc1B7Yic%3D&reserved=0>) related to iceberg’s inability to filter on int96 timestamp columns. I’d like to take a crack at fixing this, but it feels like perhaps this is uncharted territory due to the backing type in iceberg (Long) not matching the backing type from the data (ByteArray). Is it appropriate to modify ParquetConversions.converterFromParquet to add special handling for icebergType = timestamp and parquetType = int96, or is a more fundamental change required? -- Ryan Blue Tabular -- Ryan Blue
