Artem Kupchinskiy created SPARK-52868:
-----------------------------------------

             Summary: CBO: OOM-risky iceberg table stats underestimation for 
some filters
                 Key: SPARK-52868
                 URL: https://issues.apache.org/jira/browse/SPARK-52868
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.0, 3.5.6
            Reporter: Artem Kupchinskiy


Some sources, iceberg for instance, provide limited set of column stats. In 
particular, there is still [no min and max stats 
implementation][[https://github.com/apache/iceberg/issues/11083]] 

And, current filter estimation logic for EqualTo and InSet predicate have a bug 
estimating row count based on column stats where min/max is empty but there is 
distinctCount defined (exactly iceberg case) as zero. That might cause OOM due 
to broadcasting poorly estimated tables as well as suboptimal order in join 
chains after reorder. 

 

Some data sources like Iceberg provide only partial column statistics. 
Specifically, Iceberg currently [lacks min/max stats 
implementation][[https://github.com/apache/iceberg/issues/11083]] , though it 
does provide distinctCount values.

The current filter estimation logic for EqualTo and InSet predicates contains a 
bug: when min/max stats are unavailable but distinctCount is present (the 
typical Iceberg scenario), it incorrectly estimates the filtered row count as 
zero. This causes two major issues:
 # *Broadcast OOM failures* - Large tables get misclassified as small and 
broadcast to all executors
 # *Suboptimal join ordering* - The join reordering logic makes poor decisions 
based on these incorrect estimates

Instead of assuming zero selectivity when min/max bounds are missing, the 
estimation should remain undefined, allowing the optimizer to fall back to 
safer strategies.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to