prodeezy commented on issue #99: Iceberg fails to return results when filtered on complex columns URL: https://github.com/apache/incubator-iceberg/issues/99#issuecomment-465026220 # Vanilla Spark Parquet reader plan == Physical Plan == *(1) Project [age#428, name#429, friends#430, location#431] +- *(1) Filter (isnotnull(friends#430) && (friends#430[Josh] = 10)) +- *(1) FileScan parquet [age#428,name#429,friends#430,location#431] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/usr/local/spark/test/parquet-people-complex], PartitionFilters: [], PushedFilters: [IsNotNull(friends)], ReadSchema: struct<age:int,name:string,friends:map<string,int>,location:struct<lat:int,lon:int>> # Iceberg Plan == Physical Plan == *(1) Project [age#33, name#34, friends#35] +- *(1) Filter ((friends#35[Josh] = 10) && isnotnull(friends#35)) +- *(1) ScanV2 iceberg[age#33, name#34, friends#35] (Filters: [isnotnull(friends#35)], Options: [path=iceberg-people-complex2,paths=[]]) # Couple of points : 1) Complex predicate is not passed down to the Scan level in both plans. The complex predicate is termed "non-translateable" by DataSourceStrategy.translateFilter() [1] when trying to convert Catalyst expression to data source filter. Ryan & Xabriel had a discussion earlier on this list about Spark not passing expressions to data source (in certain cases). This might be related to that. Maybe a path forward is to fix that translation in Spark so that Iceberg Filter conversion has a chance to handle complex type. Currently Iceberg Reader code is unaware of that filter. 2) Although both vanilla Spark and Iceberg handle complex type predicates post scan, this regression is caused by post scan filtering not returning results in the Iceberg case. I think post scan filtering is unable to handle Iceberg format. So if 1) is not the way forward then the alternative way is to fix this in the post scan filtering. [1] - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L450
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
