Sorry, my bad for not checking out the github issue! Looks like we need to update `ParquetMetricsRowGroupFilter` to ignore INT96 stats. I think we can probably do that when getting the stats from file metadata, here: https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java#L97
On Thu, May 6, 2021 at 10:30 AM Scott Kruger <[email protected]> wrote: > The github issue has a strack trace (as well as a diff to a unit test that > illiustrates the problem). I can add some more context if needed. The gist > is that a ClassCastException is thrown because, during the filter step when > it’s comparing the value against the upper/lower bounds, it wants to > compare the literal value (a long from the timestamp filter) to a > binary/bytearray value (because the converter for int96 returns a byte > array). > > > > *From: *Ryan Blue <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Thursday, May 6, 2021 at 12:12 PM > *To: *"[email protected]" <[email protected]> > *Cc: *"[email protected]" <[email protected]> > *Subject: *Re: Filtering on int96 timestamps > > > > This message contains hyperlinks, take precaution before opening these > links. > > Scott, can you tell me more about where this is failing? > > > > I think we should definitely update Iceberg to ignore INT96 columns when > filtering. I'm surprised that it isn't already happening since we don't > have metadata for those columns. Sounds like it may be somewhere in our > Parquet row group or dictionary filters. A stack trace would help us track > it down if you can share one. > > > > On Thu, May 6, 2021 at 7:15 AM Scott Kruger <[email protected]> > wrote: > > Is there any way to prevent the pushdown for int96 columns? At least that > would prevent spark from crashing. > > > > (Aside: this is kind of a bummer, as I was able to get the unit test > passing last night; I guess it wouldn’t work for all cases though). > > > > *From: *Ryan Blue <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Wednesday, May 5, 2021 at 7:40 PM > *To: *"[email protected]" <[email protected]> > *Cc: *"[email protected]" <[email protected]> > *Subject: *Re: Filtering on int96 timestamps > > > > This message contains hyperlinks, take precaution before opening these > links. > > This is actually quite difficult. > > > > The main problem is that Parquet stats are unreliable for INT96 > timestamps. That means that Iceberg won't have stats information for INT96 > either because we get them from Parquet. There is no metadata to use for > filter pushdown, so I'm not sure what you could do. You might be able to > build something to read all of the timestamps from a data file and produce > a min/max range and then write that into Iceberg metadata directly. But > that's a pretty low-level operation and I'm not sure it would be worth the > investment. > > > > I think it would be easier just to rewrite the dataset to use the > timestamp formats that are better supported. You'll get a lot more reliable > results that way, although it takes a lot of up-front work. > > > > Ryan > > > > On Wed, May 5, 2021 at 3:45 PM Scott Kruger <[email protected]> > wrote: > > I just submitted a github issue ( > https://github.com/apache/iceberg/issues/2553 > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fissues%2F2553&data=04%7C01%7Csckruger%40paypal.com%7Ce96c8d125111408f5f8a08d9102778dc%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637558584422640190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ptjOqUq%2BMBny%2FYpdhnFw5VxKxKOE%2BBI3O3rRc1B7Yic%3D&reserved=0>) > related to iceberg’s inability to filter on int96 timestamp columns. I’d > like to take a crack at fixing this, but it feels like perhaps this is > uncharted territory due to the backing type in iceberg (Long) not matching > the backing type from the data (ByteArray). > > > > Is it appropriate to modify ParquetConversions.converterFromParquet to add > special handling for icebergType = timestamp and parquetType = int96, or is > a more fundamental change required? > > > > > -- > > Ryan Blue > > Tabular > > > > > -- > > Ryan Blue > -- Ryan Blue
