GitHub user shadowmmu created a discussion: [Performance] Velox Bloom Filter Inefficiency vs. Photon at 1TB Scale
While benchmarking TPC-H 1TB on Gluten+Velox, we identified a major performance bottleneck compared to Databricks Photon in shuffle-heavy queries (Q8, Q9, Q17, Q18, Q21). **The Evidence** * **Photon:** Effectively prunes **95-99%** of `lineitem` records using Bloom Filters, keeping Disk I/O below 10%. * **Velox:** Shows negligible filtering, leading to **30-50% Disk I/O**. * **Total Runtime Impact:** Photon achieved a **1.83x speedup** overall, but when I/O-heavy queries are excluded, the gap narrows to **1.13x**. **The Root Cause** Spark has a default limit of 10MB for Bloom Filter, so when we increased the limit to 1GB to make comparable with Databricks Photon, bloom filter is created after config change but filter is negligable. The Velox backend seems limited by a hardcoded Bloom Filter size of **4,194,304 bits** (). This limit is too low to maintain low false-positive rates for 1TB cardinality, rendering the filter ineffective. **Questions for Maintainers** 1. Are there plans to allow the `maxNumBits` for Bloom Filters to scale beyond the current hardcoded limit? 2. Why is Velox failing to generate/utilize the filter as effectively as Photon at this scale? Spark Plan for Q17 Databricks <img width="2110" height="5875" alt="Historical Spark UI for cluster 5202-183210-uyaoq4aw, driver 7894865519245697362 - Details for Query 19" src="https://github.com/user-attachments/assets/82cfd588-ffd5-4394-8cbe-4acff09bcb1f" /> Velox <img width="3366" height="15080" alt="localhost_18080_history_app-20260202185100-0001_SQL_execution__id=266" src="https://github.com/user-attachments/assets/2317597d-4b89-4a4e-99ce-ed1e0fa78052" /> GitHub link: https://github.com/apache/incubator-gluten/discussions/11554 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
