GitHub user shadowmmu created a discussion: [Performance] Velox Bloom Filter 
Inefficiency vs. Photon at 1TB Scale

While benchmarking TPC-H 1TB on Gluten+Velox, we identified a major performance 
bottleneck compared to Databricks Photon in shuffle-heavy queries (Q8, Q9, Q17, 
Q18, Q21).

**The Evidence**

* **Photon:** Effectively prunes **95-99%** of `lineitem` records using Bloom 
Filters, keeping Disk I/O below 10%.
* **Velox:** Shows negligible filtering, leading to **30-50% Disk I/O**.
* **Total Runtime Impact:** Photon achieved a **1.83x speedup** overall, but 
when I/O-heavy queries are excluded, the gap narrows to **1.13x**.

**The Root Cause**
Spark has a default limit of 10MB for Bloom Filter, so when we increased the 
limit to 1GB to make comparable with Databricks Photon, bloom filter is created 
after config change but filter is negligable.

The Velox backend seems limited by a hardcoded Bloom Filter size of **4,194,304 
bits** (). This limit is too low to maintain low false-positive rates for 1TB 
cardinality, rendering the filter ineffective.

**Questions for Maintainers**

1. Are there plans to allow the `maxNumBits` for Bloom Filters to scale beyond 
the current hardcoded limit?
2. Why is Velox failing to generate/utilize the filter as effectively as Photon 
at this scale?

Spark Plan for Q17 

Databricks 
<img width="2110" height="5875" alt="Historical Spark UI for cluster 
5202-183210-uyaoq4aw, driver 7894865519245697362 - Details for Query 19" 
src="https://github.com/user-attachments/assets/82cfd588-ffd5-4394-8cbe-4acff09bcb1f";
 />


Velox
<img width="3366" height="15080" 
alt="localhost_18080_history_app-20260202185100-0001_SQL_execution__id=266" 
src="https://github.com/user-attachments/assets/2317597d-4b89-4a4e-99ce-ed1e0fa78052";
 />


GitHub link: https://github.com/apache/incubator-gluten/discussions/11554

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to