Artem Kupchinskiy created SPARK-52868: -----------------------------------------
Summary: CBO: OOM-risky iceberg table stats underestimation for some filters Key: SPARK-52868 URL: https://issues.apache.org/jira/browse/SPARK-52868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0, 3.5.6 Reporter: Artem Kupchinskiy Some sources, iceberg for instance, provide limited set of column stats. In particular, there is still [no min and max stats implementation][[https://github.com/apache/iceberg/issues/11083]] And, current filter estimation logic for EqualTo and InSet predicate have a bug estimating row count based on column stats where min/max is empty but there is distinctCount defined (exactly iceberg case) as zero. That might cause OOM due to broadcasting poorly estimated tables as well as suboptimal order in join chains after reorder. Some data sources like Iceberg provide only partial column statistics. Specifically, Iceberg currently [lacks min/max stats implementation][[https://github.com/apache/iceberg/issues/11083]] , though it does provide distinctCount values. The current filter estimation logic for EqualTo and InSet predicates contains a bug: when min/max stats are unavailable but distinctCount is present (the typical Iceberg scenario), it incorrectly estimates the filtered row count as zero. This causes two major issues: # *Broadcast OOM failures* - Large tables get misclassified as small and broadcast to all executors # *Suboptimal join ordering* - The join reordering logic makes poor decisions based on these incorrect estimates Instead of assuming zero selectivity when min/max bounds are missing, the estimation should remain undefined, allowing the optimizer to fall back to safer strategies. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org