Re: [PR] Spark: Fix aggregate pushdown [iceberg]

via GitHub Mon, 02 Mar 2026 13:17:47 -0800


psvri commented on PR #15070:
URL: https://github.com/apache/iceberg/pull/15070#issuecomment-3986957475


   Hello,
   
   Really sorry for the late reply, things got it the way. The issue happens 
when a data file contains nan count , lower bound and upper bound. The pyspark 
code snippet shows the same.
   
   ```
   spark = SparkSession.builder \
       .appName("HadoopCatalogExample") \
       .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
       .config("spark.sql.catalog.spark_catalog", 
"org.apache.iceberg.spark.SparkCatalog") \
       .config("spark.sql.catalog.spark_catalog.type", "hadoop") \
       .config("spark.sql.catalog.spark_catalog.warehouse", 
"/tmp/iceberg_warehouse") \
       .config("spark.jars.packages", 
"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1") \
       .getOrCreate()
       
   table_name = 'temp'
   spark.sql(f"CREATE TABLE {table_name} (id int, data float) USING iceberg 
PARTITIONED BY (id)").show()
   
   spark.sql(f"INSERT INTO {table_name} VALUES (2, 2), (2, float('nan'))")
   
   spark.sql(f"select readable_metrics.data.* from 
default.{table_name}.files").show(truncate=False, vertical=True)
   
   -RECORD 0---------------
    column_size      | 49  
    value_count      | 2   
    null_value_count | 0   
    nan_value_count  | 1   
    lower_bound      | 2.0 
    upper_bound      | 2.0 
    
   spark.conf.set("spark.sql.iceberg.aggregate-push-down.enabled", "true")
   spark.sql(f"select max(data) from {table_name}").show()
   
   +---------+
   |max(data)|
   +---------+
   |      2.0|
   +---------+
   
   spark.conf.set("spark.sql.iceberg.aggregate-push-down.enabled", "false")
   spark.sql(f"select max(data) from {table_name}").show()
   
   +---------+
   |max(data)|
   +---------+
   |      NaN|
   +---------+
   ```
   
   Now coming to why the bug was not triggered with the unit test case. If we 
inspect the redable metrics of the datafiles created in the unit test case we 
see the below
   
   ```
   -RECORD 0----------------
    column_size      | 78   
    value_count      | 2    
    null_value_count | 0    
    nan_value_count  | 2    
    lower_bound      | NULL 
    upper_bound      | NULL 
   -RECORD 1----------------
    column_size      | 49   
    value_count      | 2    
    null_value_count | 0    
    nan_value_count  | 1    
    lower_bound      | 2.0  
    upper_bound      | 2.0  
   -RECORD 2----------------
    column_size      | 49   
    value_count      | 2    
    null_value_count | 0    
    nan_value_count  | 1    
    lower_bound      | 1.0  
    upper_bound      | 1.0 
   ```
   
   So basically, we have one file where lower bound and upper bound is set as 
null. This means 
[MaxAggregate.hasValue](https://github.com/apache/iceberg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java#L42)
 will return false for this file and that in turn will make isvalid to be false 
[here](https://github.com/apache/iceberg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java#L151)
 . Next in 
[SparkScanBuilder](https://github.com/apache/iceberg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L149)
  we [check if each aggregate is 
valid](https://github.com/apache/iceberg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java#L90).
 Because `isvalid ` is set to false, due to [this 
return](https://github.com/apache/icebe
 
rg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java#L172)
 the check will fail and hence aggregate pushdown dosent happen.
   
   
   Now why my one line change in UT triggers the bug is due to fact that we 
dont encounter a scenario where one file contains only nan and dosent contains 
lower bound and upper bound as  shown below. Because all files have lower bound 
and upper bounds, isValid is not set to false. So the check in ScanBuilder 
returns true and pushdown happens. 
   
   ```
   -RECORD 0----------------
    column_size      | 83   
    value_count      | 3    
    null_value_count | 0    
    nan_value_count  | 2    
    lower_bound      | 10.0 
    upper_bound      | 10.0 
   -RECORD 1----------------
    column_size      | 49   
    value_count      | 2    
    null_value_count | 0    
    nan_value_count  | 1    
    lower_bound      | 2.0  
    upper_bound      | 2.0  
   -RECORD 2----------------
    column_size      | 49   
    value_count      | 2    
    null_value_count | 0    
    nan_value_count  | 1    
    lower_bound      | 1.0  
    upper_bound      | 1.0 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Fix aggregate pushdown [iceberg]

Reply via email to