(datafusion-site) branch asf-staging updated: Commit build products

github-bot Tue, 09 Sep 2025 04:29:10 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/asf-staging by this push:
     new 7d39bf6  Commit build products
7d39bf6 is described below

commit 7d39bf61eab0f221b7e2ecb004cba056028339c2
Author: Build Pelican (action) <priv...@infra.apache.org>
AuthorDate: Tue Sep 9 11:28:55 2025 +0000

    Commit build products
---
 blog/2025/09/10/dynamic-filters/index.html         | 40 ++++++++++++++--------
 ...aracco-pydantic-andrew-lamb-influxdata.atom.xml | 40 ++++++++++++++--------
 blog/feeds/all-en.atom.xml                         | 40 ++++++++++++++--------
 blog/feeds/blog.atom.xml                           | 40 ++++++++++++++--------
 4 files changed, 100 insertions(+), 60 deletions(-)

diff --git a/blog/2025/09/10/dynamic-filters/index.html 
b/blog/2025/09/10/dynamic-filters/index.html
index c6e3b6d..9734783 100644
--- a/blog/2025/09/10/dynamic-filters/index.html
+++ b/blog/2025/09/10/dynamic-filters/index.html
@@ -292,31 +292,41 @@ improve hash joins by implementing a technique called <a 
href="https://15721.cou
 passing</a>, which is similar to <a 
href="https://issues.apache.org/jira/browse/SPARK-32268";>Bloom filter joins</a> 
in Apache Spark. See 
 <a href="https://github.com/apache/datafusion/issues/7955";>issue #7955</a> for 
more details.</p>
 <p>In a Hash Join, the query engine picks one input of the join to be the 
"build"
-input and the other input to be the "probe" side. A hash join first 
<em>builds</em> a
-hash table by reading the build input into memory, and then it reads the probe
-input, using the hash table to find matching rows from the probe side. Many 
hash
-joins are very selective (only a small number of rows are matched) and act as
-filters, so it is natural to use the same dynamic filter technique to create
-filters on the probe side scan with the values seen on the build side.</p>
-<p>DataFusion 50.0.0 adds dynamic filters to the probe input using min/max join
-key values from the build side. This simple approach is fast to evaluate and 
the
-filter improves performance significantly when combined with statistics 
pruning,
-late materialization, and other optimizations as shown in Figure 7.</p>
+input and the other input to be the "probe" side.</p>
+<ul>
+<li>
+<p>First, the <strong>build side</strong> is loaded into memory, and turned 
into a hash table.</p>
+</li>
+<li>
+<p>Then, the <strong>probe side</strong> is scanned, and matching rows are 
found by looking 
+  in the hash table. Non-matching rows are discarded and thus joins often act 
as
+  filters.</p>
+</li>
+</ul>
+<p>Many hash joins are very selective (only a small number of rows are 
matched), so
+it is natural to use the same dynamic filter technique. DataFusion 50.0.0 
pushes
+down knowledge of what keys exist on the build side into the scan of the probe
+side with a dynamic filter based on min/max join key values. For example, if 
the
+build side only has keys in the range <code>[100, 200]</code>, then DataFusion 
will filter
+all probe rows with keys outside that range during the scan.</p>
+<p>This simple approach is fast to evaluate and the filter improves performance
+significantly when combined with statistics pruning, late materialization, and
+other optimizations as shown in Figure 7.</p>
 <div class="text-center">
 <img alt="Join Performance Improvements with Dynamic Filters" 
class="img-responsive" src="/blog/images/dynamic-filters/join-performance.svg" 
width="80%"/>
 </div>
 <p><strong>Figure 7</strong>: Join performance with and without dynamic 
filters. In DataFusion
-49.0.2 the join takes 2.5s, even with late materialization enabled. In
+49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In
 DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes
 only 0.7s, a 5x improvement. With both dynamic filters and late 
materialization,
 DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this <a 
href="https://github.com/apache/datafusion-site/pull/103#issuecomment-3262612288";>discussion</a>
 for more
 details.</p>
 <p>You can see dynamic join filters in action with the following example. </p>
 <pre><code class="language-sql">-- create two tables: small_table with 1K rows 
and large_table with 100K rows
-copy (select i as k, i as v from generate_series(1, 1000) t(i)) to 
'small_table.parquet';
-create external table small_table stored as parquet location 
'small_table.parquet';
-copy (select i as k from generate_series(1, 100000) t(i)) to 
'large_table.parquet';
-create external table large_table stored as parquet location 
'large_table.parquet';
+COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 
'small_table.parquet';
+CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 
'small_table.parquet';
+COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 
'large_table.parquet';
+CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 
'large_table.parquet';
 
 -- Join the two tables, with a filter on small_table
 EXPLAIN 
diff --git 
a/blog/feeds/adrian-garcia-badaracco-pydantic-andrew-lamb-influxdata.atom.xml 
b/blog/feeds/adrian-garcia-badaracco-pydantic-andrew-lamb-influxdata.atom.xml
index 64cdffa..5c7cfa9 100644
--- 
a/blog/feeds/adrian-garcia-badaracco-pydantic-andrew-lamb-influxdata.atom.xml
+++ 
b/blog/feeds/adrian-garcia-badaracco-pydantic-andrew-lamb-influxdata.atom.xml
@@ -276,31 +276,41 @@ improve hash joins by implementing a technique called 
&lt;a href="https://15721.
 passing&lt;/a&gt;, which is similar to &lt;a 
href="https://issues.apache.org/jira/browse/SPARK-32268"&gt;Bloom filter 
joins&lt;/a&gt; in Apache Spark. See 
 &lt;a href="https://github.com/apache/datafusion/issues/7955"&gt;issue 
#7955&lt;/a&gt; for more details.&lt;/p&gt;
 &lt;p&gt;In a Hash Join, the query engine picks one input of the join to be 
the "build"
-input and the other input to be the "probe" side. A hash join first 
&lt;em&gt;builds&lt;/em&gt; a
-hash table by reading the build input into memory, and then it reads the probe
-input, using the hash table to find matching rows from the probe side. Many 
hash
-joins are very selective (only a small number of rows are matched) and act as
-filters, so it is natural to use the same dynamic filter technique to create
-filters on the probe side scan with the values seen on the build 
side.&lt;/p&gt;
-&lt;p&gt;DataFusion 50.0.0 adds dynamic filters to the probe input using 
min/max join
-key values from the build side. This simple approach is fast to evaluate and 
the
-filter improves performance significantly when combined with statistics 
pruning,
-late materialization, and other optimizations as shown in Figure 7.&lt;/p&gt;
+input and the other input to be the "probe" side.&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;
+&lt;p&gt;First, the &lt;strong&gt;build side&lt;/strong&gt; is loaded into 
memory, and turned into a hash table.&lt;/p&gt;
+&lt;/li&gt;
+&lt;li&gt;
+&lt;p&gt;Then, the &lt;strong&gt;probe side&lt;/strong&gt; is scanned, and 
matching rows are found by looking 
+  in the hash table. Non-matching rows are discarded and thus joins often act 
as
+  filters.&lt;/p&gt;
+&lt;/li&gt;
+&lt;/ul&gt;
+&lt;p&gt;Many hash joins are very selective (only a small number of rows are 
matched), so
+it is natural to use the same dynamic filter technique. DataFusion 50.0.0 
pushes
+down knowledge of what keys exist on the build side into the scan of the probe
+side with a dynamic filter based on min/max join key values. For example, if 
the
+build side only has keys in the range &lt;code&gt;[100, 200]&lt;/code&gt;, 
then DataFusion will filter
+all probe rows with keys outside that range during the scan.&lt;/p&gt;
+&lt;p&gt;This simple approach is fast to evaluate and the filter improves 
performance
+significantly when combined with statistics pruning, late materialization, and
+other optimizations as shown in Figure 7.&lt;/p&gt;
 &lt;div class="text-center"&gt;
 &lt;img alt="Join Performance Improvements with Dynamic Filters" 
class="img-responsive" src="/blog/images/dynamic-filters/join-performance.svg" 
width="80%"/&gt;
 &lt;/div&gt;
 &lt;p&gt;&lt;strong&gt;Figure 7&lt;/strong&gt;: Join performance with and 
without dynamic filters. In DataFusion
-49.0.2 the join takes 2.5s, even with late materialization enabled. In
+49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In
 DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes
 only 0.7s, a 5x improvement. With both dynamic filters and late 
materialization,
 DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this &lt;a 
href="https://github.com/apache/datafusion-site/pull/103#issuecomment-3262612288"&gt;discussion&lt;/a&gt;
 for more
 details.&lt;/p&gt;
 &lt;p&gt;You can see dynamic join filters in action with the following 
example. &lt;/p&gt;
 &lt;pre&gt;&lt;code class="language-sql"&gt;-- create two tables: small_table 
with 1K rows and large_table with 100K rows
-copy (select i as k, i as v from generate_series(1, 1000) t(i)) to 
'small_table.parquet';
-create external table small_table stored as parquet location 
'small_table.parquet';
-copy (select i as k from generate_series(1, 100000) t(i)) to 
'large_table.parquet';
-create external table large_table stored as parquet location 
'large_table.parquet';
+COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 
'small_table.parquet';
+CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 
'small_table.parquet';
+COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 
'large_table.parquet';
+CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 
'large_table.parquet';
 
 -- Join the two tables, with a filter on small_table
 EXPLAIN 
diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml
index 5cb00a5..3a0f672 100644
--- a/blog/feeds/all-en.atom.xml
+++ b/blog/feeds/all-en.atom.xml
@@ -276,31 +276,41 @@ improve hash joins by implementing a technique called 
&lt;a href="https://15721.
 passing&lt;/a&gt;, which is similar to &lt;a 
href="https://issues.apache.org/jira/browse/SPARK-32268"&gt;Bloom filter 
joins&lt;/a&gt; in Apache Spark. See 
 &lt;a href="https://github.com/apache/datafusion/issues/7955"&gt;issue 
#7955&lt;/a&gt; for more details.&lt;/p&gt;
 &lt;p&gt;In a Hash Join, the query engine picks one input of the join to be 
the "build"
-input and the other input to be the "probe" side. A hash join first 
&lt;em&gt;builds&lt;/em&gt; a
-hash table by reading the build input into memory, and then it reads the probe
-input, using the hash table to find matching rows from the probe side. Many 
hash
-joins are very selective (only a small number of rows are matched) and act as
-filters, so it is natural to use the same dynamic filter technique to create
-filters on the probe side scan with the values seen on the build 
side.&lt;/p&gt;
-&lt;p&gt;DataFusion 50.0.0 adds dynamic filters to the probe input using 
min/max join
-key values from the build side. This simple approach is fast to evaluate and 
the
-filter improves performance significantly when combined with statistics 
pruning,
-late materialization, and other optimizations as shown in Figure 7.&lt;/p&gt;
+input and the other input to be the "probe" side.&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;
+&lt;p&gt;First, the &lt;strong&gt;build side&lt;/strong&gt; is loaded into 
memory, and turned into a hash table.&lt;/p&gt;
+&lt;/li&gt;
+&lt;li&gt;
+&lt;p&gt;Then, the &lt;strong&gt;probe side&lt;/strong&gt; is scanned, and 
matching rows are found by looking 
+  in the hash table. Non-matching rows are discarded and thus joins often act 
as
+  filters.&lt;/p&gt;
+&lt;/li&gt;
+&lt;/ul&gt;
+&lt;p&gt;Many hash joins are very selective (only a small number of rows are 
matched), so
+it is natural to use the same dynamic filter technique. DataFusion 50.0.0 
pushes
+down knowledge of what keys exist on the build side into the scan of the probe
+side with a dynamic filter based on min/max join key values. For example, if 
the
+build side only has keys in the range &lt;code&gt;[100, 200]&lt;/code&gt;, 
then DataFusion will filter
+all probe rows with keys outside that range during the scan.&lt;/p&gt;
+&lt;p&gt;This simple approach is fast to evaluate and the filter improves 
performance
+significantly when combined with statistics pruning, late materialization, and
+other optimizations as shown in Figure 7.&lt;/p&gt;
 &lt;div class="text-center"&gt;
 &lt;img alt="Join Performance Improvements with Dynamic Filters" 
class="img-responsive" src="/blog/images/dynamic-filters/join-performance.svg" 
width="80%"/&gt;
 &lt;/div&gt;
 &lt;p&gt;&lt;strong&gt;Figure 7&lt;/strong&gt;: Join performance with and 
without dynamic filters. In DataFusion
-49.0.2 the join takes 2.5s, even with late materialization enabled. In
+49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In
 DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes
 only 0.7s, a 5x improvement. With both dynamic filters and late 
materialization,
 DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this &lt;a 
href="https://github.com/apache/datafusion-site/pull/103#issuecomment-3262612288"&gt;discussion&lt;/a&gt;
 for more
 details.&lt;/p&gt;
 &lt;p&gt;You can see dynamic join filters in action with the following 
example. &lt;/p&gt;
 &lt;pre&gt;&lt;code class="language-sql"&gt;-- create two tables: small_table 
with 1K rows and large_table with 100K rows
-copy (select i as k, i as v from generate_series(1, 1000) t(i)) to 
'small_table.parquet';
-create external table small_table stored as parquet location 
'small_table.parquet';
-copy (select i as k from generate_series(1, 100000) t(i)) to 
'large_table.parquet';
-create external table large_table stored as parquet location 
'large_table.parquet';
+COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 
'small_table.parquet';
+CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 
'small_table.parquet';
+COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 
'large_table.parquet';
+CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 
'large_table.parquet';
 
 -- Join the two tables, with a filter on small_table
 EXPLAIN 
diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml
index 200124d..ebab74f 100644
--- a/blog/feeds/blog.atom.xml
+++ b/blog/feeds/blog.atom.xml
@@ -276,31 +276,41 @@ improve hash joins by implementing a technique called 
&lt;a href="https://15721.
 passing&lt;/a&gt;, which is similar to &lt;a 
href="https://issues.apache.org/jira/browse/SPARK-32268"&gt;Bloom filter 
joins&lt;/a&gt; in Apache Spark. See 
 &lt;a href="https://github.com/apache/datafusion/issues/7955"&gt;issue 
#7955&lt;/a&gt; for more details.&lt;/p&gt;
 &lt;p&gt;In a Hash Join, the query engine picks one input of the join to be 
the "build"
-input and the other input to be the "probe" side. A hash join first 
&lt;em&gt;builds&lt;/em&gt; a
-hash table by reading the build input into memory, and then it reads the probe
-input, using the hash table to find matching rows from the probe side. Many 
hash
-joins are very selective (only a small number of rows are matched) and act as
-filters, so it is natural to use the same dynamic filter technique to create
-filters on the probe side scan with the values seen on the build 
side.&lt;/p&gt;
-&lt;p&gt;DataFusion 50.0.0 adds dynamic filters to the probe input using 
min/max join
-key values from the build side. This simple approach is fast to evaluate and 
the
-filter improves performance significantly when combined with statistics 
pruning,
-late materialization, and other optimizations as shown in Figure 7.&lt;/p&gt;
+input and the other input to be the "probe" side.&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;
+&lt;p&gt;First, the &lt;strong&gt;build side&lt;/strong&gt; is loaded into 
memory, and turned into a hash table.&lt;/p&gt;
+&lt;/li&gt;
+&lt;li&gt;
+&lt;p&gt;Then, the &lt;strong&gt;probe side&lt;/strong&gt; is scanned, and 
matching rows are found by looking 
+  in the hash table. Non-matching rows are discarded and thus joins often act 
as
+  filters.&lt;/p&gt;
+&lt;/li&gt;
+&lt;/ul&gt;
+&lt;p&gt;Many hash joins are very selective (only a small number of rows are 
matched), so
+it is natural to use the same dynamic filter technique. DataFusion 50.0.0 
pushes
+down knowledge of what keys exist on the build side into the scan of the probe
+side with a dynamic filter based on min/max join key values. For example, if 
the
+build side only has keys in the range &lt;code&gt;[100, 200]&lt;/code&gt;, 
then DataFusion will filter
+all probe rows with keys outside that range during the scan.&lt;/p&gt;
+&lt;p&gt;This simple approach is fast to evaluate and the filter improves 
performance
+significantly when combined with statistics pruning, late materialization, and
+other optimizations as shown in Figure 7.&lt;/p&gt;
 &lt;div class="text-center"&gt;
 &lt;img alt="Join Performance Improvements with Dynamic Filters" 
class="img-responsive" src="/blog/images/dynamic-filters/join-performance.svg" 
width="80%"/&gt;
 &lt;/div&gt;
 &lt;p&gt;&lt;strong&gt;Figure 7&lt;/strong&gt;: Join performance with and 
without dynamic filters. In DataFusion
-49.0.2 the join takes 2.5s, even with late materialization enabled. In
+49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In
 DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes
 only 0.7s, a 5x improvement. With both dynamic filters and late 
materialization,
 DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this &lt;a 
href="https://github.com/apache/datafusion-site/pull/103#issuecomment-3262612288"&gt;discussion&lt;/a&gt;
 for more
 details.&lt;/p&gt;
 &lt;p&gt;You can see dynamic join filters in action with the following 
example. &lt;/p&gt;
 &lt;pre&gt;&lt;code class="language-sql"&gt;-- create two tables: small_table 
with 1K rows and large_table with 100K rows
-copy (select i as k, i as v from generate_series(1, 1000) t(i)) to 
'small_table.parquet';
-create external table small_table stored as parquet location 
'small_table.parquet';
-copy (select i as k from generate_series(1, 100000) t(i)) to 
'large_table.parquet';
-create external table large_table stored as parquet location 
'large_table.parquet';
+COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 
'small_table.parquet';
+CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 
'small_table.parquet';
+COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 
'large_table.parquet';
+CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 
'large_table.parquet';
 
 -- Join the two tables, with a filter on small_table
 EXPLAIN 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-site) branch asf-staging updated: Commit build products

Reply via email to