prodeezy commented on issue #99: Iceberg fails to return results when filtered 
on complex columns
URL: 
https://github.com/apache/incubator-iceberg/issues/99#issuecomment-465026220
 
 
   
   # Vanilla Spark Parquet reader plan
   == Physical Plan ==
   *(1) Project [age#428, name#429, friends#430, location#431]
   +- *(1) Filter (isnotnull(friends#430) && (friends#430[Josh] = 10))
      +- *(1) FileScan parquet [age#428,name#429,friends#430,location#431] 
Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/usr/local/spark/test/parquet-people-complex], 
PartitionFilters: [], PushedFilters: [IsNotNull(friends)], ReadSchema: 
struct<age:int,name:string,friends:map<string,int>,location:struct<lat:int,lon:int>>
   
   #  Iceberg Plan
   == Physical Plan ==
   *(1) Project [age#33, name#34, friends#35]
   +- *(1) Filter ((friends#35[Josh] = 10) && isnotnull(friends#35))
      +- *(1) ScanV2 iceberg[age#33, name#34, friends#35] (Filters: 
[isnotnull(friends#35)], Options: [path=iceberg-people-complex2,paths=[]])
   
   
   # Couple of points :
   1)  Complex predicate is not passed down to the Scan level in both plans. 
The complex predicate is termed "non-translateable" by 
DataSourceStrategy.translateFilter()  [1] when trying to convert Catalyst 
expression to data source filter. Ryan & Xabriel had a discussion earlier on 
this list about Spark not passing expressions to data source (in certain 
cases). This might be related to that. Maybe a path forward is to fix that 
translation in Spark so that Iceberg Filter conversion has a chance to handle 
complex type. Currently Iceberg Reader code is unaware of that filter.
   
   2) Although both vanilla Spark and Iceberg handle complex type predicates 
post scan, this regression is caused by post scan filtering not returning 
results in the Iceberg case. I think post scan filtering is unable to handle 
Iceberg format. So if 1) is not the way forward then the alternative way is to 
fix this in the post scan filtering.
   
   
   [1] - 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L450

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to