This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-staging by this push: new 7d39bf6 Commit build products 7d39bf6 is described below commit 7d39bf61eab0f221b7e2ecb004cba056028339c2 Author: Build Pelican (action) <priv...@infra.apache.org> AuthorDate: Tue Sep 9 11:28:55 2025 +0000 Commit build products --- blog/2025/09/10/dynamic-filters/index.html | 40 ++++++++++++++-------- ...aracco-pydantic-andrew-lamb-influxdata.atom.xml | 40 ++++++++++++++-------- blog/feeds/all-en.atom.xml | 40 ++++++++++++++-------- blog/feeds/blog.atom.xml | 40 ++++++++++++++-------- 4 files changed, 100 insertions(+), 60 deletions(-) diff --git a/blog/2025/09/10/dynamic-filters/index.html b/blog/2025/09/10/dynamic-filters/index.html index c6e3b6d..9734783 100644 --- a/blog/2025/09/10/dynamic-filters/index.html +++ b/blog/2025/09/10/dynamic-filters/index.html @@ -292,31 +292,41 @@ improve hash joins by implementing a technique called <a href="https://15721.cou passing</a>, which is similar to <a href="https://issues.apache.org/jira/browse/SPARK-32268">Bloom filter joins</a> in Apache Spark. See <a href="https://github.com/apache/datafusion/issues/7955">issue #7955</a> for more details.</p> <p>In a Hash Join, the query engine picks one input of the join to be the "build" -input and the other input to be the "probe" side. A hash join first <em>builds</em> a -hash table by reading the build input into memory, and then it reads the probe -input, using the hash table to find matching rows from the probe side. Many hash -joins are very selective (only a small number of rows are matched) and act as -filters, so it is natural to use the same dynamic filter technique to create -filters on the probe side scan with the values seen on the build side.</p> -<p>DataFusion 50.0.0 adds dynamic filters to the probe input using min/max join -key values from the build side. This simple approach is fast to evaluate and the -filter improves performance significantly when combined with statistics pruning, -late materialization, and other optimizations as shown in Figure 7.</p> +input and the other input to be the "probe" side.</p> +<ul> +<li> +<p>First, the <strong>build side</strong> is loaded into memory, and turned into a hash table.</p> +</li> +<li> +<p>Then, the <strong>probe side</strong> is scanned, and matching rows are found by looking + in the hash table. Non-matching rows are discarded and thus joins often act as + filters.</p> +</li> +</ul> +<p>Many hash joins are very selective (only a small number of rows are matched), so +it is natural to use the same dynamic filter technique. DataFusion 50.0.0 pushes +down knowledge of what keys exist on the build side into the scan of the probe +side with a dynamic filter based on min/max join key values. For example, if the +build side only has keys in the range <code>[100, 200]</code>, then DataFusion will filter +all probe rows with keys outside that range during the scan.</p> +<p>This simple approach is fast to evaluate and the filter improves performance +significantly when combined with statistics pruning, late materialization, and +other optimizations as shown in Figure 7.</p> <div class="text-center"> <img alt="Join Performance Improvements with Dynamic Filters" class="img-responsive" src="/blog/images/dynamic-filters/join-performance.svg" width="80%"/> </div> <p><strong>Figure 7</strong>: Join performance with and without dynamic filters. In DataFusion -49.0.2 the join takes 2.5s, even with late materialization enabled. In +49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes only 0.7s, a 5x improvement. With both dynamic filters and late materialization, DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this <a href="https://github.com/apache/datafusion-site/pull/103#issuecomment-3262612288">discussion</a> for more details.</p> <p>You can see dynamic join filters in action with the following example. </p> <pre><code class="language-sql">-- create two tables: small_table with 1K rows and large_table with 100K rows -copy (select i as k, i as v from generate_series(1, 1000) t(i)) to 'small_table.parquet'; -create external table small_table stored as parquet location 'small_table.parquet'; -copy (select i as k from generate_series(1, 100000) t(i)) to 'large_table.parquet'; -create external table large_table stored as parquet location 'large_table.parquet'; +COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 'small_table.parquet'; +CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 'small_table.parquet'; +COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 'large_table.parquet'; +CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 'large_table.parquet'; -- Join the two tables, with a filter on small_table EXPLAIN diff --git a/blog/feeds/adrian-garcia-badaracco-pydantic-andrew-lamb-influxdata.atom.xml b/blog/feeds/adrian-garcia-badaracco-pydantic-andrew-lamb-influxdata.atom.xml index 64cdffa..5c7cfa9 100644 --- a/blog/feeds/adrian-garcia-badaracco-pydantic-andrew-lamb-influxdata.atom.xml +++ b/blog/feeds/adrian-garcia-badaracco-pydantic-andrew-lamb-influxdata.atom.xml @@ -276,31 +276,41 @@ improve hash joins by implementing a technique called <a href="https://15721. passing</a>, which is similar to <a href="https://issues.apache.org/jira/browse/SPARK-32268">Bloom filter joins</a> in Apache Spark. See <a href="https://github.com/apache/datafusion/issues/7955">issue #7955</a> for more details.</p> <p>In a Hash Join, the query engine picks one input of the join to be the "build" -input and the other input to be the "probe" side. A hash join first <em>builds</em> a -hash table by reading the build input into memory, and then it reads the probe -input, using the hash table to find matching rows from the probe side. Many hash -joins are very selective (only a small number of rows are matched) and act as -filters, so it is natural to use the same dynamic filter technique to create -filters on the probe side scan with the values seen on the build side.</p> -<p>DataFusion 50.0.0 adds dynamic filters to the probe input using min/max join -key values from the build side. This simple approach is fast to evaluate and the -filter improves performance significantly when combined with statistics pruning, -late materialization, and other optimizations as shown in Figure 7.</p> +input and the other input to be the "probe" side.</p> +<ul> +<li> +<p>First, the <strong>build side</strong> is loaded into memory, and turned into a hash table.</p> +</li> +<li> +<p>Then, the <strong>probe side</strong> is scanned, and matching rows are found by looking + in the hash table. Non-matching rows are discarded and thus joins often act as + filters.</p> +</li> +</ul> +<p>Many hash joins are very selective (only a small number of rows are matched), so +it is natural to use the same dynamic filter technique. DataFusion 50.0.0 pushes +down knowledge of what keys exist on the build side into the scan of the probe +side with a dynamic filter based on min/max join key values. For example, if the +build side only has keys in the range <code>[100, 200]</code>, then DataFusion will filter +all probe rows with keys outside that range during the scan.</p> +<p>This simple approach is fast to evaluate and the filter improves performance +significantly when combined with statistics pruning, late materialization, and +other optimizations as shown in Figure 7.</p> <div class="text-center"> <img alt="Join Performance Improvements with Dynamic Filters" class="img-responsive" src="/blog/images/dynamic-filters/join-performance.svg" width="80%"/> </div> <p><strong>Figure 7</strong>: Join performance with and without dynamic filters. In DataFusion -49.0.2 the join takes 2.5s, even with late materialization enabled. In +49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes only 0.7s, a 5x improvement. With both dynamic filters and late materialization, DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this <a href="https://github.com/apache/datafusion-site/pull/103#issuecomment-3262612288">discussion</a> for more details.</p> <p>You can see dynamic join filters in action with the following example. </p> <pre><code class="language-sql">-- create two tables: small_table with 1K rows and large_table with 100K rows -copy (select i as k, i as v from generate_series(1, 1000) t(i)) to 'small_table.parquet'; -create external table small_table stored as parquet location 'small_table.parquet'; -copy (select i as k from generate_series(1, 100000) t(i)) to 'large_table.parquet'; -create external table large_table stored as parquet location 'large_table.parquet'; +COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 'small_table.parquet'; +CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 'small_table.parquet'; +COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 'large_table.parquet'; +CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 'large_table.parquet'; -- Join the two tables, with a filter on small_table EXPLAIN diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml index 5cb00a5..3a0f672 100644 --- a/blog/feeds/all-en.atom.xml +++ b/blog/feeds/all-en.atom.xml @@ -276,31 +276,41 @@ improve hash joins by implementing a technique called <a href="https://15721. passing</a>, which is similar to <a href="https://issues.apache.org/jira/browse/SPARK-32268">Bloom filter joins</a> in Apache Spark. See <a href="https://github.com/apache/datafusion/issues/7955">issue #7955</a> for more details.</p> <p>In a Hash Join, the query engine picks one input of the join to be the "build" -input and the other input to be the "probe" side. A hash join first <em>builds</em> a -hash table by reading the build input into memory, and then it reads the probe -input, using the hash table to find matching rows from the probe side. Many hash -joins are very selective (only a small number of rows are matched) and act as -filters, so it is natural to use the same dynamic filter technique to create -filters on the probe side scan with the values seen on the build side.</p> -<p>DataFusion 50.0.0 adds dynamic filters to the probe input using min/max join -key values from the build side. This simple approach is fast to evaluate and the -filter improves performance significantly when combined with statistics pruning, -late materialization, and other optimizations as shown in Figure 7.</p> +input and the other input to be the "probe" side.</p> +<ul> +<li> +<p>First, the <strong>build side</strong> is loaded into memory, and turned into a hash table.</p> +</li> +<li> +<p>Then, the <strong>probe side</strong> is scanned, and matching rows are found by looking + in the hash table. Non-matching rows are discarded and thus joins often act as + filters.</p> +</li> +</ul> +<p>Many hash joins are very selective (only a small number of rows are matched), so +it is natural to use the same dynamic filter technique. DataFusion 50.0.0 pushes +down knowledge of what keys exist on the build side into the scan of the probe +side with a dynamic filter based on min/max join key values. For example, if the +build side only has keys in the range <code>[100, 200]</code>, then DataFusion will filter +all probe rows with keys outside that range during the scan.</p> +<p>This simple approach is fast to evaluate and the filter improves performance +significantly when combined with statistics pruning, late materialization, and +other optimizations as shown in Figure 7.</p> <div class="text-center"> <img alt="Join Performance Improvements with Dynamic Filters" class="img-responsive" src="/blog/images/dynamic-filters/join-performance.svg" width="80%"/> </div> <p><strong>Figure 7</strong>: Join performance with and without dynamic filters. In DataFusion -49.0.2 the join takes 2.5s, even with late materialization enabled. In +49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes only 0.7s, a 5x improvement. With both dynamic filters and late materialization, DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this <a href="https://github.com/apache/datafusion-site/pull/103#issuecomment-3262612288">discussion</a> for more details.</p> <p>You can see dynamic join filters in action with the following example. </p> <pre><code class="language-sql">-- create two tables: small_table with 1K rows and large_table with 100K rows -copy (select i as k, i as v from generate_series(1, 1000) t(i)) to 'small_table.parquet'; -create external table small_table stored as parquet location 'small_table.parquet'; -copy (select i as k from generate_series(1, 100000) t(i)) to 'large_table.parquet'; -create external table large_table stored as parquet location 'large_table.parquet'; +COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 'small_table.parquet'; +CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 'small_table.parquet'; +COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 'large_table.parquet'; +CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 'large_table.parquet'; -- Join the two tables, with a filter on small_table EXPLAIN diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml index 200124d..ebab74f 100644 --- a/blog/feeds/blog.atom.xml +++ b/blog/feeds/blog.atom.xml @@ -276,31 +276,41 @@ improve hash joins by implementing a technique called <a href="https://15721. passing</a>, which is similar to <a href="https://issues.apache.org/jira/browse/SPARK-32268">Bloom filter joins</a> in Apache Spark. See <a href="https://github.com/apache/datafusion/issues/7955">issue #7955</a> for more details.</p> <p>In a Hash Join, the query engine picks one input of the join to be the "build" -input and the other input to be the "probe" side. A hash join first <em>builds</em> a -hash table by reading the build input into memory, and then it reads the probe -input, using the hash table to find matching rows from the probe side. Many hash -joins are very selective (only a small number of rows are matched) and act as -filters, so it is natural to use the same dynamic filter technique to create -filters on the probe side scan with the values seen on the build side.</p> -<p>DataFusion 50.0.0 adds dynamic filters to the probe input using min/max join -key values from the build side. This simple approach is fast to evaluate and the -filter improves performance significantly when combined with statistics pruning, -late materialization, and other optimizations as shown in Figure 7.</p> +input and the other input to be the "probe" side.</p> +<ul> +<li> +<p>First, the <strong>build side</strong> is loaded into memory, and turned into a hash table.</p> +</li> +<li> +<p>Then, the <strong>probe side</strong> is scanned, and matching rows are found by looking + in the hash table. Non-matching rows are discarded and thus joins often act as + filters.</p> +</li> +</ul> +<p>Many hash joins are very selective (only a small number of rows are matched), so +it is natural to use the same dynamic filter technique. DataFusion 50.0.0 pushes +down knowledge of what keys exist on the build side into the scan of the probe +side with a dynamic filter based on min/max join key values. For example, if the +build side only has keys in the range <code>[100, 200]</code>, then DataFusion will filter +all probe rows with keys outside that range during the scan.</p> +<p>This simple approach is fast to evaluate and the filter improves performance +significantly when combined with statistics pruning, late materialization, and +other optimizations as shown in Figure 7.</p> <div class="text-center"> <img alt="Join Performance Improvements with Dynamic Filters" class="img-responsive" src="/blog/images/dynamic-filters/join-performance.svg" width="80%"/> </div> <p><strong>Figure 7</strong>: Join performance with and without dynamic filters. In DataFusion -49.0.2 the join takes 2.5s, even with late materialization enabled. In +49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes only 0.7s, a 5x improvement. With both dynamic filters and late materialization, DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this <a href="https://github.com/apache/datafusion-site/pull/103#issuecomment-3262612288">discussion</a> for more details.</p> <p>You can see dynamic join filters in action with the following example. </p> <pre><code class="language-sql">-- create two tables: small_table with 1K rows and large_table with 100K rows -copy (select i as k, i as v from generate_series(1, 1000) t(i)) to 'small_table.parquet'; -create external table small_table stored as parquet location 'small_table.parquet'; -copy (select i as k from generate_series(1, 100000) t(i)) to 'large_table.parquet'; -create external table large_table stored as parquet location 'large_table.parquet'; +COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 'small_table.parquet'; +CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 'small_table.parquet'; +COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 'large_table.parquet'; +CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 'large_table.parquet'; -- Join the two tables, with a filter on small_table EXPLAIN --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org