(datafusion-site) branch asf-staging updated: Commit build products

github-bot Fri, 20 Mar 2026 15:22:36 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/asf-staging by this push:
     new c2f82bc  Commit build products
c2f82bc is described below

commit c2f82bcf91b4ddc9f30441ae704268cc5c980999
Author: Build Pelican (action) <[email protected]>
AuthorDate: Fri Mar 20 22:21:02 2026 +0000

    Commit build products
---
 blog/2026/03/20/writing-table-providers/index.html |   8 +-
 blog/author/tim-saucer-rerunio.html                |  64 +++
 blog/author/timsaucer.html                         |  32 --
 blog/category/blog.html                            |   2 +-
 blog/feed.xml                                      |   2 +-
 blog/feeds/all-en.atom.xml                         |   6 +-
 blog/feeds/blog.atom.xml                           |   6 +-
 blog/feeds/tim-saucer-rerunio.atom.xml             | 601 +++++++++++++++++++++
 blog/feeds/tim-saucer-rerunio.rss.xml              |  24 +
 blog/feeds/timsaucer.atom.xml                      | 597 +-------------------
 blog/feeds/timsaucer.rss.xml                       |  24 +-
 blog/index.html                                    |   2 +-
 12 files changed, 711 insertions(+), 657 deletions(-)

diff --git a/blog/2026/03/20/writing-table-providers/index.html 
b/blog/2026/03/20/writing-table-providers/index.html
index 8636a6f..9a7c9fb 100644
--- a/blog/2026/03/20/writing-table-providers/index.html
+++ b/blog/2026/03/20/writing-table-providers/index.html
@@ -48,7 +48,7 @@
         <h1>
           Writing Custom Table Providers in Apache DataFusion
         </h1>
-        <p>Posted on: Fri 20 March 2026 by timsaucer</p>
+        <p>Posted on: Fri 20 March 2026 by Tim Saucer (rerun.io)</p>
 
         <aside class="toc-container d-md-none mb-2">
           <div class="toc"><span class="toctitle">Contents</span><ul>
@@ -81,6 +81,7 @@
 </li>
 <li><a href="#putting-it-all-together">Putting It All Together</a></li>
 <li><a href="#choosing-the-right-starting-point">Choosing the Right Starting 
Point</a></li>
+<li><a href="#acknowledgements">Acknowledgements</a></li>
 <li><a href="#further-reading">Further Reading</a></li>
 </ul>
 </div>
@@ -648,6 +649,10 @@ level makes sense:</p>
 [<code>RecordBatchStreamAdapter</code>] provides a good balance of simplicity 
and
 flexibility. You provide a closure that returns a stream, and DataFusion 
handles
 the rest.</p>
+<h2 id="acknowledgements">Acknowledgements<a class="headerlink" 
href="#acknowledgements" title="Permanent link">¶</a></h2>
+<p>I would like to thank <a href="https://rerun.io";>Rerun.io</a> for 
sponsoring the development of this work. <a href="https://rerun.io";>Rerun.io</a>
+is building a data visualization system for Physical AI and makes heavy use of 
DataFusion
+table providers for working with data analytics.</p>
 <h2 id="further-reading">Further Reading<a class="headerlink" 
href="#further-reading" title="Permanent link">¶</a></h2>
 <hr/>
 <ul>
@@ -722,6 +727,7 @@ the rest.</p>
 </li>
 <li><a href="#putting-it-all-together">Putting It All Together</a></li>
 <li><a href="#choosing-the-right-starting-point">Choosing the Right Starting 
Point</a></li>
+<li><a href="#acknowledgements">Acknowledgements</a></li>
 <li><a href="#further-reading">Further Reading</a></li>
 </ul>
 </div>
diff --git a/blog/author/tim-saucer-rerunio.html 
b/blog/author/tim-saucer-rerunio.html
new file mode 100644
index 0000000..d88cf14
--- /dev/null
+++ b/blog/author/tim-saucer-rerunio.html
@@ -0,0 +1,64 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+        <title>Apache DataFusion Blog - Articles by Tim Saucer 
(rerun.io)</title>
+        <meta charset="utf-8" />
+        <meta name="generator" content="Pelican" />
+        <link href="https://datafusion.apache.org/blog/feed.xml"; 
type="application/rss+xml" rel="alternate" title="Apache DataFusion Blog RSS 
Feed" />
+</head>
+
+<body id="index" class="home">
+        <header id="banner" class="body">
+                <h1><a href="https://datafusion.apache.org/blog/";>Apache 
DataFusion Blog</a></h1>
+        </header><!-- /#banner -->
+        <nav id="menu"><ul>
+            <li><a 
href="https://datafusion.apache.org/blog/pages/about.html";>About</a></li>
+            <li><a 
href="https://datafusion.apache.org/blog/pages/index.html";>index</a></li>
+            <li><a 
href="https://datafusion.apache.org/blog/category/blog.html";>blog</a></li>
+        </ul></nav><!-- /#menu -->
+<section id="content">
+<h2>Articles by Tim Saucer (rerun.io)</h2>
+
+<ol id="post-list">
+        <li><article class="hentry">
+                <header> <h2 class="entry-title"><a 
href="https://datafusion.apache.org/blog/2026/03/20/writing-table-providers"; 
rel="bookmark" title="Permalink to Writing Custom Table Providers in Apache 
DataFusion">Writing Custom Table Providers in Apache DataFusion</a></h2> 
</header>
+                <footer class="post-info">
+                    <time class="published" 
datetime="2026-03-20T00:00:00+00:00"> Fri 20 March 2026 </time>
+                    <address class="vcard author">By
+                        <a class="url fn" 
href="https://datafusion.apache.org/blog/author/tim-saucer-rerunio.html";>Tim 
Saucer (rerun.io)</a>
+                    </address>
+                </footer><!-- /.post-info -->
+                <div class="entry-content"> <!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<p>One of DataFusion's greatest strengths is its extensibility. If your data 
lives
+in a custom format, behind an API, or in a system that DataFusion does not
+natively support, you can teach DataFusion to read it by implementing a
+<strong>custom table provider</strong>. This post walks through the three 
layers you …</p> </div><!-- /.entry-content -->
+        </article></li>
+</ol><!-- /#posts-list -->
+</section><!-- /#content -->
+        <footer id="contentinfo" class="body">
+                <address id="about" class="vcard body">
+                Proudly powered by <a 
href="https://getpelican.com/";>Pelican</a>,
+                which takes great advantage of <a 
href="https://www.python.org/";>Python</a>.
+                </address><!-- /#about -->
+        </footer><!-- /#contentinfo -->
+</body>
+</html>
\ No newline at end of file
diff --git a/blog/author/timsaucer.html b/blog/author/timsaucer.html
index aab5e7d..2711c4a 100644
--- a/blog/author/timsaucer.html
+++ b/blog/author/timsaucer.html
@@ -20,38 +20,6 @@
 <h2>Articles by timsaucer</h2>
 
 <ol id="post-list">
-        <li><article class="hentry">
-                <header> <h2 class="entry-title"><a 
href="https://datafusion.apache.org/blog/2026/03/20/writing-table-providers"; 
rel="bookmark" title="Permalink to Writing Custom Table Providers in Apache 
DataFusion">Writing Custom Table Providers in Apache DataFusion</a></h2> 
</header>
-                <footer class="post-info">
-                    <time class="published" 
datetime="2026-03-20T00:00:00+00:00"> Fri 20 March 2026 </time>
-                    <address class="vcard author">By
-                        <a class="url fn" 
href="https://datafusion.apache.org/blog/author/timsaucer.html";>timsaucer</a>
-                    </address>
-                </footer><!-- /.post-info -->
-                <div class="entry-content"> <!--
-{% comment %}
-Licensed to the Apache Software Foundation (ASF) under one or more
-contributor license agreements.  See the NOTICE file distributed with
-this work for additional information regarding copyright ownership.
-The ASF licenses this file to you under the Apache License, Version 2.0
-(the "License"); you may not use this file except in compliance with
-the License.  You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-{% endcomment %}
--->
-
-<p>One of DataFusion's greatest strengths is its extensibility. If your data 
lives
-in a custom format, behind an API, or in a system that DataFusion does not
-natively support, you can teach DataFusion to read it by implementing a
-<strong>custom table provider</strong>. This post walks through the three 
layers you …</p> </div><!-- /.entry-content -->
-        </article></li>
         <li><article class="hentry">
                 <header> <h2 class="entry-title"><a 
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0"; 
rel="bookmark" title="Permalink to Apache DataFusion Python 46.0.0 
Released">Apache DataFusion Python 46.0.0 Released</a></h2> </header>
                 <footer class="post-info">
diff --git a/blog/category/blog.html b/blog/category/blog.html
index f8add02..46e8826 100644
--- a/blog/category/blog.html
+++ b/blog/category/blog.html
@@ -76,7 +76,7 @@ figcaption {
                 <footer class="post-info">
                     <time class="published" 
datetime="2026-03-20T00:00:00+00:00"> Fri 20 March 2026 </time>
                     <address class="vcard author">By
-                        <a class="url fn" 
href="https://datafusion.apache.org/blog/author/timsaucer.html";>timsaucer</a>
+                        <a class="url fn" 
href="https://datafusion.apache.org/blog/author/tim-saucer-rerunio.html";>Tim 
Saucer (rerun.io)</a>
                     </address>
                 </footer><!-- /.post-info -->
                 <div class="entry-content"> <!--
diff --git a/blog/feed.xml b/blog/feed.xml
index 6b86d5b..ede9222 100644
--- a/blog/feed.xml
+++ b/blog/feed.xml
@@ -61,7 +61,7 @@ limitations under the License.
 &lt;p&gt;One of DataFusion's greatest strengths is its extensibility. If your 
data lives
 in a custom format, behind an API, or in a system that DataFusion does not
 natively support, you can teach DataFusion to read it by implementing a
-&lt;strong&gt;custom table provider&lt;/strong&gt;. This post walks through 
the three layers you …&lt;/p&gt;</description><dc:creator 
xmlns:dc="http://purl.org/dc/elements/1.1/";>timsaucer</dc:creator><pubDate>Fri, 
20 Mar 2026 00:00:00 +0000</pubDate><guid 
isPermaLink="false">tag:datafusion.apache.org,2026-03-20:/blog/2026/03/20/writing-table-providers</guid><category>blog</category></item><item><title>Optimizing
 SQL CASE Expression Evaluation</title><link>https://datafusion.apache.org/bl 
[...]
+&lt;strong&gt;custom table provider&lt;/strong&gt;. This post walks through 
the three layers you …&lt;/p&gt;</description><dc:creator 
xmlns:dc="http://purl.org/dc/elements/1.1/";>Tim Saucer 
(rerun.io)</dc:creator><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid 
isPermaLink="false">tag:datafusion.apache.org,2026-03-20:/blog/2026/03/20/writing-table-providers</guid><category>blog</category></item><item><title>Optimizing
 SQL CASE Expression Evaluation</title><link>https://datafusion.a [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml
index bd45c8b..b58e3fa 100644
--- a/blog/feeds/all-en.atom.xml
+++ b/blog/feeds/all-en.atom.xml
@@ -310,7 +310,7 @@ limit_pruned_row_groups=3 total → 1 matched
 &lt;p&gt;&lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; is an extensible query engine, written in &lt;a 
href="https://www.rust-lang.org/"&gt;Rust&lt;/a&gt;, that uses &lt;a 
href="https://arrow.apache.org"&gt;Apache Arrow&lt;/a&gt; as its in-memory 
format. DataFusion is used by developers to create new, fast, data-centric 
systems such as databases, dataframe libraries, and machine learning and 
streaming applications.&lt;/p&gt;
 &lt;p&gt;DataFusion's core thesis is that, as a community, together we can 
build much more advanced technology than any of us as individuals or companies 
could build alone.&lt;/p&gt;
 &lt;h2 id="how-to-get-involved"&gt;How to Get Involved&lt;a class="headerlink" 
href="#how-to-get-involved" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;p&gt;If you are interested in contributing, we would love to have you. You 
can try out DataFusion on some of your own data and projects and let us know 
how it goes, contribute suggestions, documentation, bug reports, or a PR with 
documentation, tests, or code. A list of open issues suitable for beginners is 
&lt;a 
href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22"&gt;here&lt;/a&gt;,
 and you can find out how to reach us on the &lt;a [...]
+&lt;p&gt;If you are interested in contributing, we would love to have you. You 
can try out DataFusion on some of your own data and projects and let us know 
how it goes, contribute suggestions, documentation, bug reports, or a PR with 
documentation, tests, or code. A list of open issues suitable for beginners is 
&lt;a 
href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22"&gt;here&lt;/a&gt;,
 and you can find out how to reach us on the &lt;a [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -894,6 +894,10 @@ level makes sense:&lt;/p&gt;
 [&lt;code&gt;RecordBatchStreamAdapter&lt;/code&gt;] provides a good balance of 
simplicity and
 flexibility. You provide a closure that returns a stream, and DataFusion 
handles
 the rest.&lt;/p&gt;
+&lt;h2 id="acknowledgements"&gt;Acknowledgements&lt;a class="headerlink" 
href="#acknowledgements" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;p&gt;I would like to thank &lt;a 
href="https://rerun.io"&gt;Rerun.io&lt;/a&gt; for sponsoring the development of 
this work. &lt;a href="https://rerun.io"&gt;Rerun.io&lt;/a&gt;
+is building a data visualization system for Physical AI and makes heavy use of 
DataFusion
+table providers for working with data analytics.&lt;/p&gt;
 &lt;h2 id="further-reading"&gt;Further Reading&lt;a class="headerlink" 
href="#further-reading" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
 &lt;hr/&gt;
 &lt;ul&gt;
diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml
index 9ab9a85..2cdd2c4 100644
--- a/blog/feeds/blog.atom.xml
+++ b/blog/feeds/blog.atom.xml
@@ -310,7 +310,7 @@ limit_pruned_row_groups=3 total → 1 matched
 &lt;p&gt;&lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; is an extensible query engine, written in &lt;a 
href="https://www.rust-lang.org/"&gt;Rust&lt;/a&gt;, that uses &lt;a 
href="https://arrow.apache.org"&gt;Apache Arrow&lt;/a&gt; as its in-memory 
format. DataFusion is used by developers to create new, fast, data-centric 
systems such as databases, dataframe libraries, and machine learning and 
streaming applications.&lt;/p&gt;
 &lt;p&gt;DataFusion's core thesis is that, as a community, together we can 
build much more advanced technology than any of us as individuals or companies 
could build alone.&lt;/p&gt;
 &lt;h2 id="how-to-get-involved"&gt;How to Get Involved&lt;a class="headerlink" 
href="#how-to-get-involved" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;p&gt;If you are interested in contributing, we would love to have you. You 
can try out DataFusion on some of your own data and projects and let us know 
how it goes, contribute suggestions, documentation, bug reports, or a PR with 
documentation, tests, or code. A list of open issues suitable for beginners is 
&lt;a 
href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22"&gt;here&lt;/a&gt;,
 and you can find out how to reach us on the &lt;a [...]
+&lt;p&gt;If you are interested in contributing, we would love to have you. You 
can try out DataFusion on some of your own data and projects and let us know 
how it goes, contribute suggestions, documentation, bug reports, or a PR with 
documentation, tests, or code. A list of open issues suitable for beginners is 
&lt;a 
href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22"&gt;here&lt;/a&gt;,
 and you can find out how to reach us on the &lt;a [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -894,6 +894,10 @@ level makes sense:&lt;/p&gt;
 [&lt;code&gt;RecordBatchStreamAdapter&lt;/code&gt;] provides a good balance of 
simplicity and
 flexibility. You provide a closure that returns a stream, and DataFusion 
handles
 the rest.&lt;/p&gt;
+&lt;h2 id="acknowledgements"&gt;Acknowledgements&lt;a class="headerlink" 
href="#acknowledgements" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;p&gt;I would like to thank &lt;a 
href="https://rerun.io"&gt;Rerun.io&lt;/a&gt; for sponsoring the development of 
this work. &lt;a href="https://rerun.io"&gt;Rerun.io&lt;/a&gt;
+is building a data visualization system for Physical AI and makes heavy use of 
DataFusion
+table providers for working with data analytics.&lt;/p&gt;
 &lt;h2 id="further-reading"&gt;Further Reading&lt;a class="headerlink" 
href="#further-reading" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
 &lt;hr/&gt;
 &lt;ul&gt;
diff --git a/blog/feeds/tim-saucer-rerunio.atom.xml 
b/blog/feeds/tim-saucer-rerunio.atom.xml
new file mode 100644
index 0000000..47af897
--- /dev/null
+++ b/blog/feeds/tim-saucer-rerunio.atom.xml
@@ -0,0 +1,601 @@
+<?xml version="1.0" encoding="utf-8"?>
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - Tim 
Saucer (rerun.io)</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/tim-saucer-rerunio.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2026-03-20T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Writing
 Custom Table Providers in Apache DataFusion</title><link 
href="https://datafusion.apac [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+--&gt;
+
+&lt;p&gt;One of DataFusion's greatest strengths is its extensibility. If your 
data lives
+in a custom format, behind an API, or in a system that DataFusion does not
+natively support, you can teach DataFusion to read it by implementing a
+&lt;strong&gt;custom table provider&lt;/strong&gt;. This post walks through 
the three layers you …&lt;/p&gt;</summary><content type="html">&lt;!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+--&gt;
+
+&lt;p&gt;One of DataFusion's greatest strengths is its extensibility. If your 
data lives
+in a custom format, behind an API, or in a system that DataFusion does not
+natively support, you can teach DataFusion to read it by implementing a
+&lt;strong&gt;custom table provider&lt;/strong&gt;. This post walks through 
the three layers you need to
+understand and explains where your work should actually happen.&lt;/p&gt;
+&lt;h2 id="the-three-layers"&gt;The Three Layers&lt;a class="headerlink" 
href="#the-three-layers" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;p&gt;When DataFusion executes a query against a table, three abstractions 
collaborate
+to produce results:&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;&lt;strong&gt;[&lt;code&gt;TableProvider&lt;/code&gt;]&lt;/strong&gt;
 -- Describes the table (schema, capabilities) and
+   produces an execution plan when queried.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;[&lt;code&gt;ExecutionPlan&lt;/code&gt;]&lt;/strong&gt;
 -- Describes &lt;em&gt;how&lt;/em&gt; to compute the result: partitioning,
+   ordering, and child plan relationships.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;[&lt;code&gt;SendableRecordBatchStream&lt;/code&gt;]&lt;/strong&gt;
 -- The async stream that &lt;em&gt;actually does the
+   work&lt;/em&gt;, yielding &lt;code&gt;RecordBatch&lt;/code&gt;es one at a 
time.&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;Think of these as a funnel: 
&lt;code&gt;TableProvider::scan()&lt;/code&gt; is called once during
+planning to create an &lt;code&gt;ExecutionPlan&lt;/code&gt;, then 
&lt;code&gt;ExecutionPlan::execute()&lt;/code&gt; is called
+once per partition to create a stream, and those streams are where rows are
+actually produced during execution.&lt;/p&gt;
+&lt;h2 id="layer-1-tableprovider"&gt;Layer 1: TableProvider&lt;a 
class="headerlink" href="#layer-1-tableprovider" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;p&gt;A [&lt;code&gt;TableProvider&lt;/code&gt;] represents a queryable 
data source. For a minimal read-only
+table, you need four methods:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-rust"&gt;impl TableProvider for MyTable {
+    fn as_any(&amp;amp;self) -&amp;gt; &amp;amp;dyn Any { self }
+
+    fn schema(&amp;amp;self) -&amp;gt; SchemaRef {
+        Arc::clone(&amp;amp;self.schema)
+    }
+
+    fn table_type(&amp;amp;self) -&amp;gt; TableType {
+        TableType::Base
+    }
+
+    async fn scan(
+        &amp;amp;self,
+        state: &amp;amp;dyn Session,
+        projection: Option&amp;lt;&amp;amp;Vec&amp;lt;usize&amp;gt;&amp;gt;,
+        filters: &amp;amp;[Expr],
+        limit: Option&amp;lt;usize&amp;gt;,
+    ) -&amp;gt; Result&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt; {
+        // Build and return an ExecutionPlan -- keep this lightweight!
+        Ok(Arc::new(MyExecPlan::new(
+            Arc::clone(&amp;amp;self.schema),
+            projection,
+            limit,
+        )))
+    }
+}
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;The &lt;code&gt;scan&lt;/code&gt; method is the heart of 
&lt;code&gt;TableProvider&lt;/code&gt;. It receives three pushdown
+hints from the optimizer, each reducing the amount of data your source needs
+to produce:&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;strong&gt;&lt;code&gt;projection&lt;/code&gt;&lt;/strong&gt; -- 
Which columns are needed. This reduces the &lt;strong&gt;width&lt;/strong&gt; of
+  the output. If your source supports it, read only these columns rather than
+  the full schema.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;&lt;code&gt;filters&lt;/code&gt;&lt;/strong&gt; -- 
Predicates the engine would like you to apply during the
+  scan. This reduces the &lt;strong&gt;number of rows&lt;/strong&gt; by 
skipping data that does not
+  match. Implement &lt;code&gt;supports_filters_pushdown&lt;/code&gt; to 
advertise which filters you
+  can handle.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;&lt;code&gt;limit&lt;/code&gt;&lt;/strong&gt; -- A row 
count cap. This also reduces the &lt;strong&gt;number of rows&lt;/strong&gt; --
+  if you can stop reading early once you have produced enough rows, this avoids
+  unnecessary work.&lt;/li&gt;
+&lt;/ul&gt;
+&lt;h3 id="keep-scan-lightweight"&gt;Keep &lt;code&gt;scan()&lt;/code&gt; 
Lightweight&lt;a class="headerlink" href="#keep-scan-lightweight" 
title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;This is a critical point: 
&lt;strong&gt;&lt;code&gt;scan()&lt;/code&gt; runs during planning, not 
execution.&lt;/strong&gt; It
+should return quickly. Best practices are to avoid performing I/O, network
+calls, or heavy computation here. The &lt;code&gt;scan&lt;/code&gt; method's 
job is to &lt;em&gt;describe&lt;/em&gt; how
+the data will be produced, not to produce it. All the real work belongs in the
+stream (Layer 3).&lt;/p&gt;
+&lt;p&gt;A common pitfall is to fetch data or open connections in 
&lt;code&gt;scan()&lt;/code&gt;. This blocks
+the planning thread and can cause timeouts or deadlocks, especially if the 
query
+involves multiple tables or subqueries that all need to be planned before
+execution begins.&lt;/p&gt;
+&lt;h3 id="existing-implementations-to-learn-from"&gt;Existing Implementations 
to Learn From&lt;a class="headerlink" 
href="#existing-implementations-to-learn-from" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;DataFusion ships several &lt;code&gt;TableProvider&lt;/code&gt; 
implementations that are excellent
+references:&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;strong&gt;[&lt;code&gt;MemTable&lt;/code&gt;]&lt;/strong&gt; -- 
Holds data in memory as 
&lt;code&gt;Vec&amp;lt;RecordBatch&amp;gt;&lt;/code&gt;. The simplest
+  possible provider; great for tests and small datasets.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;[&lt;code&gt;StreamTable&lt;/code&gt;]&lt;/strong&gt; 
-- Wraps a user-provided stream factory. Useful when your
+  data arrives as a continuous stream (e.g., from Kafka or a 
socket).&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;[&lt;code&gt;SortedTableProvider&lt;/code&gt;]&lt;/strong&gt;
 -- Wraps another &lt;code&gt;TableProvider&lt;/code&gt; and advertises a
+  known sort order, enabling the optimizer to skip redundant sorts.&lt;/li&gt;
+&lt;/ul&gt;
+&lt;h2 id="layer-2-executionplan"&gt;Layer 2: ExecutionPlan&lt;a 
class="headerlink" href="#layer-2-executionplan" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;p&gt;An [&lt;code&gt;ExecutionPlan&lt;/code&gt;] is a node in the physical 
query plan tree. Your table
+provider's &lt;code&gt;scan()&lt;/code&gt; method returns one. The required 
methods are:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-rust"&gt;impl ExecutionPlan for MyExecPlan 
{
+    fn name(&amp;amp;self) -&amp;gt; &amp;amp;str { "MyExecPlan" }
+
+    fn as_any(&amp;amp;self) -&amp;gt; &amp;amp;dyn Any { self }
+
+    fn properties(&amp;amp;self) -&amp;gt; &amp;amp;PlanProperties {
+        &amp;amp;self.properties
+    }
+
+    fn children(&amp;amp;self) -&amp;gt; Vec&amp;lt;&amp;amp;Arc&amp;lt;dyn 
ExecutionPlan&amp;gt;&amp;gt; {
+        vec![]  // Leaf node -- no children
+    }
+
+    fn with_new_children(
+        self: Arc&amp;lt;Self&amp;gt;,
+        children: Vec&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt;,
+    ) -&amp;gt; Result&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt; {
+        assert!(children.is_empty());
+        Ok(self)
+    }
+
+    fn execute(
+        &amp;amp;self,
+        partition: usize,
+        context: Arc&amp;lt;TaskContext&amp;gt;,
+    ) -&amp;gt; Result&amp;lt;SendableRecordBatchStream&amp;gt; {
+        // This is where you build and return your stream
+        // ...
+    }
+}
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;The key properties to set correctly in 
[&lt;code&gt;PlanProperties&lt;/code&gt;] are &lt;strong&gt;output
+partitioning&lt;/strong&gt; and &lt;strong&gt;output 
ordering&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Output partitioning&lt;/strong&gt; tells the engine how 
many partitions your data has,
+which determines parallelism. If your source naturally partitions data (e.g.,
+by file or by shard), expose that here.&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Output ordering&lt;/strong&gt; declares whether your 
data is naturally sorted. This
+enables the optimizer to avoid inserting a &lt;code&gt;SortExec&lt;/code&gt; 
when a query requires
+ordered data. Getting this right can be a significant performance 
win.&lt;/p&gt;
+&lt;h3 id="partitioning-strategies"&gt;Partitioning Strategies&lt;a 
class="headerlink" href="#partitioning-strategies" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;Since &lt;code&gt;execute()&lt;/code&gt; is called once per 
partition, partitioning directly controls
+the parallelism of your table scan. Each partition runs on its own task, so
+more partitions means more concurrent work -- up to the number of available
+cores.&lt;/p&gt;
+&lt;p&gt;Consider how your data source naturally divides its data:&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;strong&gt;By file or object:&lt;/strong&gt; If you are reading 
from S3, each file can be a
+  partition. DataFusion will read them in parallel.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;By shard or region:&lt;/strong&gt; If your source is a 
sharded database, each shard
+  maps naturally to a partition.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;By key range:&lt;/strong&gt; If your data is keyed 
(e.g., by timestamp or customer ID),
+  you can split it into ranges.&lt;/li&gt;
+&lt;/ul&gt;
+&lt;p&gt;Getting partitioning right matters because it affects everything 
downstream in
+the plan. When DataFusion needs to perform an aggregation or join, it
+repartitions data by hashing the relevant columns. If your source already
+produces data partitioned by the join or group-by key, DataFusion can skip the
+repartition step entirely -- avoiding a potentially expensive 
shuffle.&lt;/p&gt;
+&lt;p&gt;For example, if you are building a table provider for a system that 
stores
+data partitioned by &lt;code&gt;customer_id&lt;/code&gt;, and a common query 
groups by &lt;code&gt;customer_id&lt;/code&gt;:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT customer_id, SUM(amount)
+FROM my_table
+GROUP BY customer_id;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;If you declare your output partitioning as 
&lt;code&gt;Hash([customer_id], N)&lt;/code&gt;, the
+optimizer recognizes that the data is already distributed correctly for the
+aggregation and eliminates the &lt;code&gt;RepartitionExec&lt;/code&gt; that 
would otherwise appear
+in the plan. You can verify this with &lt;code&gt;EXPLAIN&lt;/code&gt; (more 
on this below).&lt;/p&gt;
+&lt;p&gt;Conversely, if you report 
&lt;code&gt;UnknownPartitioning&lt;/code&gt;, DataFusion must assume the
+worst case and will always insert repartitioning operators as needed.&lt;/p&gt;
+&lt;h3 id="keep-execute-lightweight-too"&gt;Keep 
&lt;code&gt;execute()&lt;/code&gt; Lightweight Too&lt;a class="headerlink" 
href="#keep-execute-lightweight-too" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;Like &lt;code&gt;scan()&lt;/code&gt;, the 
&lt;code&gt;execute()&lt;/code&gt; method should construct and return a stream
+without doing heavy work. The actual data production happens when the stream
+is polled. Do not block on async operations here -- build the stream and let
+the runtime drive it.&lt;/p&gt;
+&lt;h3 id="existing-implementations-to-learn-from_1"&gt;Existing 
Implementations to Learn From&lt;a class="headerlink" 
href="#existing-implementations-to-learn-from_1" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;strong&gt;[&lt;code&gt;StreamingTableExec&lt;/code&gt;]&lt;/strong&gt;
 -- Executes a streaming table scan. It takes a
+  stream factory (a closure that produces streams) and handles partitioning.
+  Good reference for wrapping external streams.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;[&lt;code&gt;DataSourceExec&lt;/code&gt;]&lt;/strong&gt;
 -- The execution plan behind DataFusion's built-in file
+  scanning (Parquet, CSV, JSON). It demonstrates sophisticated partitioning,
+  filter pushdown, and projection pushdown.&lt;/li&gt;
+&lt;/ul&gt;
+&lt;h2 id="layer-3-sendablerecordbatchstream"&gt;Layer 3: 
SendableRecordBatchStream&lt;a class="headerlink" 
href="#layer-3-sendablerecordbatchstream" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;p&gt;[&lt;code&gt;SendableRecordBatchStream&lt;/code&gt;] is where the 
real work happens. It is defined as:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-rust"&gt;type SendableRecordBatchStream =
+    Pin&amp;lt;Box&amp;lt;dyn RecordBatchStream&amp;lt;Item = 
Result&amp;lt;RecordBatch&amp;gt;&amp;gt; + Send&amp;gt;&amp;gt;;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;This is an async stream of &lt;code&gt;RecordBatch&lt;/code&gt;es 
that can be sent across threads. When
+the DataFusion runtime polls this stream, your code runs: reading files, 
calling
+APIs, transforming data, etc.&lt;/p&gt;
+&lt;h3 id="using-recordbatchstreamadapter"&gt;Using 
RecordBatchStreamAdapter&lt;a class="headerlink" 
href="#using-recordbatchstreamadapter" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;The easiest way to create a 
&lt;code&gt;SendableRecordBatchStream&lt;/code&gt; is with
+[&lt;code&gt;RecordBatchStreamAdapter&lt;/code&gt;]. It bridges any 
&lt;code&gt;futures::Stream&amp;lt;Item =
+Result&amp;lt;RecordBatch&amp;gt;&amp;gt;&lt;/code&gt; into the 
&lt;code&gt;SendableRecordBatchStream&lt;/code&gt; type:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-rust"&gt;use 
datafusion::physical_plan::stream::RecordBatchStreamAdapter;
+
+fn execute(
+    &amp;amp;self,
+    partition: usize,
+    context: Arc&amp;lt;TaskContext&amp;gt;,
+) -&amp;gt; Result&amp;lt;SendableRecordBatchStream&amp;gt; {
+    let schema = self.schema();
+    let config = self.config.clone();
+
+    let stream = futures::stream::once(async move {
+        // ALL the heavy work happens here, inside the stream:
+        // - Open connections
+        // - Read data from external sources
+        // - Transform and batch the results
+        let batches = fetch_data_from_source(&amp;amp;config).await?;
+        Ok(batches)
+    })
+    .flat_map(|result| match result {
+        Ok(batch) =&amp;gt; futures::stream::iter(vec![Ok(batch)]),
+        Err(e) =&amp;gt; futures::stream::iter(vec![Err(e)]),
+    });
+
+    Ok(Box::pin(RecordBatchStreamAdapter::new(schema, stream)))
+}
+&lt;/code&gt;&lt;/pre&gt;
+&lt;h3 id="cpu-intensive-work-use-a-separate-thread-pool"&gt;CPU-Intensive 
Work: Use a Separate Thread Pool&lt;a class="headerlink" 
href="#cpu-intensive-work-use-a-separate-thread-pool" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;If your stream performs CPU-intensive work (parsing, decompression, 
complex
+transformations), avoid blocking the tokio runtime. Instead, offload to a
+dedicated thread pool and send results back through a channel:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-rust"&gt;fn execute(
+    &amp;amp;self,
+    partition: usize,
+    context: Arc&amp;lt;TaskContext&amp;gt;,
+) -&amp;gt; Result&amp;lt;SendableRecordBatchStream&amp;gt; {
+    let schema = self.schema();
+    let config = self.config.clone();
+
+    let (tx, rx) = tokio::sync::mpsc::channel(2);
+
+    // Spawn CPU-heavy work on a blocking thread pool
+    tokio::task::spawn_blocking(move || {
+        let batches = generate_data(&amp;amp;config);
+        for batch in batches {
+            if tx.blocking_send(Ok(batch)).is_err() {
+                break; // Receiver dropped, query was cancelled
+            }
+        }
+    });
+
+    let stream = tokio_stream::wrappers::ReceiverStream::new(rx);
+    Ok(Box::pin(RecordBatchStreamAdapter::new(schema, stream)))
+}
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;This pattern keeps the async runtime responsive while your data 
generation
+runs on its own threads.&lt;/p&gt;
+&lt;h2 id="where-should-the-work-happen"&gt;Where Should the Work Happen?&lt;a 
class="headerlink" href="#where-should-the-work-happen" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;p&gt;This table summarizes what belongs at each layer:&lt;/p&gt;
+&lt;table class="table"&gt;
+&lt;thead&gt;
+&lt;tr&gt;
+&lt;th&gt;Layer&lt;/th&gt;
+&lt;th&gt;Runs During&lt;/th&gt;
+&lt;th&gt;Should Do&lt;/th&gt;
+&lt;th&gt;Should NOT Do&lt;/th&gt;
+&lt;/tr&gt;
+&lt;/thead&gt;
+&lt;tbody&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;code&gt;TableProvider::scan()&lt;/code&gt;&lt;/td&gt;
+&lt;td&gt;Planning&lt;/td&gt;
+&lt;td&gt;Build an &lt;code&gt;ExecutionPlan&lt;/code&gt; with 
metadata&lt;/td&gt;
+&lt;td&gt;I/O, network calls, heavy computation&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;code&gt;ExecutionPlan::execute()&lt;/code&gt;&lt;/td&gt;
+&lt;td&gt;Execution (once per partition)&lt;/td&gt;
+&lt;td&gt;Construct a stream, set up channels&lt;/td&gt;
+&lt;td&gt;Block on async work, read data&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;code&gt;RecordBatchStream&lt;/code&gt; (polling)&lt;/td&gt;
+&lt;td&gt;Execution&lt;/td&gt;
+&lt;td&gt;All I/O, computation, data production&lt;/td&gt;
+&lt;td&gt;--&lt;/td&gt;
+&lt;/tr&gt;
+&lt;/tbody&gt;
+&lt;/table&gt;
+&lt;p&gt;The guiding principle: &lt;strong&gt;push work as late as 
possible.&lt;/strong&gt; Planning should be
+fast so the optimizer can do its job. Execution setup should be fast so all
+partitions can start promptly. The stream is where you spend time producing
+data.&lt;/p&gt;
+&lt;h3 id="why-this-matters"&gt;Why This Matters&lt;a class="headerlink" 
href="#why-this-matters" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;When &lt;code&gt;scan()&lt;/code&gt; does heavy work, several 
problems arise:&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;&lt;strong&gt;Planning becomes slow.&lt;/strong&gt; If a query 
touches 10 tables and each &lt;code&gt;scan()&lt;/code&gt;
+   takes 500ms, planning alone takes 5 seconds before any data 
flows.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;The optimizer cannot help.&lt;/strong&gt; The 
optimizer runs between planning and
+   execution. If you have already fetched data during planning, optimizations
+   like predicate pushdown or partition pruning cannot reduce the 
work.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;Resource management breaks down.&lt;/strong&gt; 
DataFusion manages concurrency and
+   memory during execution. Work done during planning bypasses these 
controls.&lt;/li&gt;
+&lt;/ol&gt;
+&lt;h2 id="filter-pushdown-doing-less-work"&gt;Filter Pushdown: Doing Less 
Work&lt;a class="headerlink" href="#filter-pushdown-doing-less-work" 
title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;p&gt;One of the most impactful optimizations you can add to a custom table 
provider
+is &lt;strong&gt;filter pushdown&lt;/strong&gt; -- letting the source skip 
data that the query does not
+need, rather than reading everything and filtering it afterward.&lt;/p&gt;
+&lt;h3 id="how-filter-pushdown-works"&gt;How Filter Pushdown Works&lt;a 
class="headerlink" href="#how-filter-pushdown-works" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;When DataFusion plans a query with a &lt;code&gt;WHERE&lt;/code&gt; 
clause, it passes the filter
+predicates to your &lt;code&gt;scan()&lt;/code&gt; method as the 
&lt;code&gt;filters&lt;/code&gt; parameter. By default,
+DataFusion assumes your provider cannot handle any filters and inserts a
+&lt;code&gt;FilterExec&lt;/code&gt; node above your scan to apply them. But if 
your source &lt;em&gt;can&lt;/em&gt;
+evaluate some predicates during scanning -- for example, by skipping files,
+partitions, or row groups that cannot match -- you can eliminate a huge amount
+of unnecessary I/O.&lt;/p&gt;
+&lt;p&gt;To opt in, implement 
&lt;code&gt;supports_filters_pushdown&lt;/code&gt;:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-rust"&gt;fn supports_filters_pushdown(
+    &amp;amp;self,
+    filters: &amp;amp;[&amp;amp;Expr],
+) -&amp;gt; 
Result&amp;lt;Vec&amp;lt;TableProviderFilterPushDown&amp;gt;&amp;gt; {
+    Ok(filters.iter().map(|f| {
+        match f {
+            // We can fully evaluate equality filters on
+            // the partition column at the source
+            Expr::BinaryExpr(BinaryExpr {
+                left, op: Operator::Eq, right
+            }) if is_partition_column(left) || is_partition_column(right) 
=&amp;gt; {
+                TableProviderFilterPushDown::Exact
+            }
+            // All other filters: let DataFusion handle them
+            _ =&amp;gt; TableProviderFilterPushDown::Unsupported,
+        }
+    }).collect())
+}
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;The three possible responses for each filter are:&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;strong&gt;&lt;code&gt;Exact&lt;/code&gt;&lt;/strong&gt; -- Your 
source guarantees that no output rows will have a false
+  value for this predicate. Because the filter is fully evaluated at the 
source,
+  DataFusion will &lt;strong&gt;not&lt;/strong&gt; add a 
&lt;code&gt;FilterExec&lt;/code&gt; for it.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;&lt;code&gt;Inexact&lt;/code&gt;&lt;/strong&gt; -- 
Your source has the ability to reduce the data produced, but
+  the output may still include rows that do not satisfy the predicate. For
+  example, you might skip entire files based on metadata statistics but not
+  filter individual rows within a file. DataFusion will still add a 
&lt;code&gt;FilterExec&lt;/code&gt;
+  above your scan to remove any remaining rows that slipped through.&lt;/li&gt;
+&lt;li&gt;&lt;strong&gt;&lt;code&gt;Unsupported&lt;/code&gt;&lt;/strong&gt; -- 
Your source ignores this filter entirely. DataFusion
+  handles it.&lt;/li&gt;
+&lt;/ul&gt;
+&lt;h3 id="why-filter-pushdown-matters"&gt;Why Filter Pushdown Matters&lt;a 
class="headerlink" href="#why-filter-pushdown-matters" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;Consider a table with 1 billion rows partitioned by 
&lt;code&gt;region&lt;/code&gt;, and a query:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT * FROM events WHERE region 
= 'us-east-1' AND event_type = 'click';
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;&lt;strong&gt;Without filter pushdown:&lt;/strong&gt; Your table 
provider reads all 1 billion rows
+across all regions. DataFusion then applies both filters, discarding the vast
+majority of the data.&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;With filter pushdown on 
&lt;code&gt;region&lt;/code&gt;:&lt;/strong&gt; Your 
&lt;code&gt;scan()&lt;/code&gt; method sees the
+&lt;code&gt;region = 'us-east-1'&lt;/code&gt; filter and constructs an 
execution plan that only reads
+the &lt;code&gt;us-east-1&lt;/code&gt; partition. If that partition holds 100 
million rows, you have
+just eliminated 90% of the I/O. DataFusion still applies the 
&lt;code&gt;event_type&lt;/code&gt;
+filter via &lt;code&gt;FilterExec&lt;/code&gt; if you reported it as 
&lt;code&gt;Unsupported&lt;/code&gt;.&lt;/p&gt;
+&lt;h3 id="using-explain-to-debug-your-table-provider"&gt;Using EXPLAIN to 
Debug Your Table Provider&lt;a class="headerlink" 
href="#using-explain-to-debug-your-table-provider" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
+&lt;p&gt;The &lt;code&gt;EXPLAIN&lt;/code&gt; statement is your best tool for 
understanding what DataFusion is
+actually doing with your table provider. It shows the physical plan that
+DataFusion will execute, including any operators it inserted:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-sql"&gt;EXPLAIN SELECT * FROM events WHERE 
region = 'us-east-1' AND event_type = 'click';
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;If you are using DataFrames, call &lt;code&gt;.explain(false, 
false)&lt;/code&gt; for the logical plan
+or &lt;code&gt;.explain(false, true)&lt;/code&gt; for the physical plan. You 
can also print the plans
+in verbose mode with &lt;code&gt;.explain(true, true)&lt;/code&gt;.&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Before filter pushdown&lt;/strong&gt;, the plan might 
look like:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-text"&gt;FilterExec: region@0 = us-east-1 
AND event_type@1 = click
+  MyExecPlan: partitions=50
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;Here DataFusion is reading all 50 partitions and filtering everything
+afterward. The &lt;code&gt;FilterExec&lt;/code&gt; above your scan is doing 
all the predicate work.&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;After implementing pushdown for 
&lt;code&gt;region&lt;/code&gt;&lt;/strong&gt; (reported as 
&lt;code&gt;Exact&lt;/code&gt;):&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-text"&gt;FilterExec: event_type@1 = click
+  MyExecPlan: partitions=5, filter=[region = us-east-1]
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;Now your exec reads only the 5 partitions for 
&lt;code&gt;us-east-1&lt;/code&gt;, and the remaining
+&lt;code&gt;FilterExec&lt;/code&gt; only handles the 
&lt;code&gt;event_type&lt;/code&gt; predicate. The 
&lt;code&gt;region&lt;/code&gt; filter has
+been fully absorbed by your scan.&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;After implementing pushdown for both 
filters&lt;/strong&gt; (both &lt;code&gt;Exact&lt;/code&gt;):&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-text"&gt;MyExecPlan: partitions=5, 
filter=[region = us-east-1 AND event_type = click]
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;No &lt;code&gt;FilterExec&lt;/code&gt; at all -- your source handles 
everything.&lt;/p&gt;
+&lt;p&gt;Similarly, &lt;code&gt;EXPLAIN&lt;/code&gt; will reveal whether 
DataFusion is inserting unnecessary
+&lt;code&gt;SortExec&lt;/code&gt; or &lt;code&gt;RepartitionExec&lt;/code&gt; 
nodes that you could eliminate by declaring
+better output properties. Whenever your queries seem slower than expected,
+&lt;code&gt;EXPLAIN&lt;/code&gt; is the first place to look.&lt;/p&gt;
+&lt;h2 id="putting-it-all-together"&gt;Putting It All Together&lt;a 
class="headerlink" href="#putting-it-all-together" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;p&gt;Here is a minimal but complete example of a custom table provider 
that generates
+data lazily during streaming:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-rust"&gt;use std::any::Any;
+use std::sync::Arc;
+
+use arrow::array::{Int64Array, StringArray};
+use arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+use arrow::record_batch::RecordBatch;
+use datafusion::catalog::TableProvider;
+use datafusion::common::Result;
+use datafusion::datasource::TableType;
+use datafusion::execution::context::SessionState;
+use datafusion::execution::SendableRecordBatchStream;
+use datafusion::logical_expr::Expr;
+use datafusion::physical_expr::EquivalenceProperties;
+use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType};
+use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
+use datafusion::physical_plan::{
+    ExecutionPlan, Partitioning, PlanProperties,
+};
+use futures::stream;
+
+/// A table provider that generates sequential numbers on demand.
+struct CountingTable {
+    schema: SchemaRef,
+    num_partitions: usize,
+    rows_per_partition: usize,
+}
+
+impl CountingTable {
+    fn new(num_partitions: usize, rows_per_partition: usize) -&amp;gt; Self {
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("partition", DataType::Int64, false),
+            Field::new("value", DataType::Int64, false),
+        ]));
+        Self { schema, num_partitions, rows_per_partition }
+    }
+}
+
+#[async_trait::async_trait]
+impl TableProvider for CountingTable {
+    fn as_any(&amp;amp;self) -&amp;gt; &amp;amp;dyn Any { self }
+    fn schema(&amp;amp;self) -&amp;gt; SchemaRef { 
Arc::clone(&amp;amp;self.schema) }
+    fn table_type(&amp;amp;self) -&amp;gt; TableType { TableType::Base }
+
+    async fn scan(
+        &amp;amp;self,
+        _state: &amp;amp;dyn Session,
+        projection: Option&amp;lt;&amp;amp;Vec&amp;lt;usize&amp;gt;&amp;gt;,
+        _filters: &amp;amp;[Expr],
+        limit: Option&amp;lt;usize&amp;gt;,
+    ) -&amp;gt; Result&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt; {
+        // Light work only: build the plan with metadata
+        Ok(Arc::new(CountingExec {
+            schema: Arc::clone(&amp;amp;self.schema),
+            num_partitions: self.num_partitions,
+            rows_per_partition: limit
+                .unwrap_or(self.rows_per_partition)
+                .min(self.rows_per_partition),
+            properties: PlanProperties::new(
+                EquivalenceProperties::new(Arc::clone(&amp;amp;self.schema)),
+                Partitioning::UnknownPartitioning(self.num_partitions),
+                EmissionType::Incremental,
+                Boundedness::Bounded,
+            ),
+        }))
+    }
+}
+
+struct CountingExec {
+    schema: SchemaRef,
+    num_partitions: usize,
+    rows_per_partition: usize,
+    properties: PlanProperties,
+}
+
+impl ExecutionPlan for CountingExec {
+    fn name(&amp;amp;self) -&amp;gt; &amp;amp;str { "CountingExec" }
+    fn as_any(&amp;amp;self) -&amp;gt; &amp;amp;dyn Any { self }
+    fn properties(&amp;amp;self) -&amp;gt; &amp;amp;PlanProperties { 
&amp;amp;self.properties }
+    fn children(&amp;amp;self) -&amp;gt; Vec&amp;lt;&amp;amp;Arc&amp;lt;dyn 
ExecutionPlan&amp;gt;&amp;gt; { vec![] }
+
+    fn with_new_children(
+        self: Arc&amp;lt;Self&amp;gt;,
+        _children: Vec&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt;,
+    ) -&amp;gt; Result&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt; {
+        Ok(self)
+    }
+
+    fn execute(
+        &amp;amp;self,
+        partition: usize,
+        _context: Arc&amp;lt;TaskContext&amp;gt;,
+    ) -&amp;gt; Result&amp;lt;SendableRecordBatchStream&amp;gt; {
+        let schema = Arc::clone(&amp;amp;self.schema);
+        let rows = self.rows_per_partition;
+
+        // The heavy work (data generation) happens inside the stream,
+        // not here in execute().
+        let batch_stream = stream::once(async move {
+            let partitions = Int64Array::from(
+                vec![partition as i64; rows],
+            );
+            let values = Int64Array::from(
+                (0..rows as 
i64).collect::&amp;lt;Vec&amp;lt;_&amp;gt;&amp;gt;(),
+            );
+            let batch = RecordBatch::try_new(
+                Arc::clone(&amp;amp;schema),
+                vec![Arc::new(partitions), Arc::new(values)],
+            )?;
+            Ok(batch)
+        });
+
+        Ok(Box::pin(RecordBatchStreamAdapter::new(
+            Arc::clone(&amp;amp;self.schema),
+            batch_stream,
+        )))
+    }
+}
+&lt;/code&gt;&lt;/pre&gt;
+&lt;h2 id="choosing-the-right-starting-point"&gt;Choosing the Right Starting 
Point&lt;a class="headerlink" href="#choosing-the-right-starting-point" 
title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;p&gt;Not every custom data source requires implementing all three layers 
from
+scratch. DataFusion provides building blocks that let you plug in at whatever
+level makes sense:&lt;/p&gt;
+&lt;table class="table"&gt;
+&lt;thead&gt;
+&lt;tr&gt;
+&lt;th&gt;If your data is...&lt;/th&gt;
+&lt;th&gt;Start with&lt;/th&gt;
+&lt;th&gt;You implement&lt;/th&gt;
+&lt;/tr&gt;
+&lt;/thead&gt;
+&lt;tbody&gt;
+&lt;tr&gt;
+&lt;td&gt;Already in &lt;code&gt;RecordBatch&lt;/code&gt;es in 
memory&lt;/td&gt;
+&lt;td&gt;[&lt;code&gt;MemTable&lt;/code&gt;]&lt;/td&gt;
+&lt;td&gt;Nothing -- just construct it&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;An async stream of batches&lt;/td&gt;
+&lt;td&gt;[&lt;code&gt;StreamTable&lt;/code&gt;]&lt;/td&gt;
+&lt;td&gt;A stream factory&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;A table with known sort order&lt;/td&gt;
+&lt;td&gt;[&lt;code&gt;SortedTableProvider&lt;/code&gt;] wrapping another 
provider&lt;/td&gt;
+&lt;td&gt;The inner provider&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;A custom source needing full control&lt;/td&gt;
+&lt;td&gt;&lt;code&gt;TableProvider&lt;/code&gt; + 
&lt;code&gt;ExecutionPlan&lt;/code&gt; + stream&lt;/td&gt;
+&lt;td&gt;All three layers&lt;/td&gt;
+&lt;/tr&gt;
+&lt;/tbody&gt;
+&lt;/table&gt;
+&lt;p&gt;For most integrations, [&lt;code&gt;StreamTable&lt;/code&gt;] 
combined with
+[&lt;code&gt;RecordBatchStreamAdapter&lt;/code&gt;] provides a good balance of 
simplicity and
+flexibility. You provide a closure that returns a stream, and DataFusion 
handles
+the rest.&lt;/p&gt;
+&lt;h2 id="acknowledgements"&gt;Acknowledgements&lt;a class="headerlink" 
href="#acknowledgements" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;p&gt;I would like to thank &lt;a 
href="https://rerun.io"&gt;Rerun.io&lt;/a&gt; for sponsoring the development of 
this work. &lt;a href="https://rerun.io"&gt;Rerun.io&lt;/a&gt;
+is building a data visualization system for Physical AI and makes heavy use of 
DataFusion
+table providers for working with data analytics.&lt;/p&gt;
+&lt;h2 id="further-reading"&gt;Further Reading&lt;a class="headerlink" 
href="#further-reading" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
+&lt;hr/&gt;
+&lt;ul&gt;
+&lt;li&gt;[TableProvider API 
docs][&lt;code&gt;TableProvider&lt;/code&gt;]&lt;/li&gt;
+&lt;li&gt;[ExecutionPlan API 
docs][&lt;code&gt;ExecutionPlan&lt;/code&gt;]&lt;/li&gt;
+&lt;li&gt;[SendableRecordBatchStream API 
docs][&lt;code&gt;SendableRecordBatchStream&lt;/code&gt;]&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/apache/datafusion/issues/16821"&gt;GitHub issue 
discussing table provider examples&lt;/a&gt;&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/apache/datafusion/tree/main/datafusion-examples/examples"&gt;DataFusion
 examples directory&lt;/a&gt; --
+  contains working examples including custom table providers&lt;/li&gt;
+&lt;/ul&gt;
+&lt;hr/&gt;
+&lt;p&gt;&lt;em&gt;Note: Portions of this blog post were written with the 
assistance of an AI agent.&lt;/em&gt;&lt;/p&gt;</content><category 
term="blog"></category></entry></feed>
\ No newline at end of file
diff --git a/blog/feeds/tim-saucer-rerunio.rss.xml 
b/blog/feeds/tim-saucer-rerunio.rss.xml
new file mode 100644
index 0000000..1a357d5
--- /dev/null
+++ b/blog/feeds/tim-saucer-rerunio.rss.xml
@@ -0,0 +1,24 @@
+<?xml version="1.0" encoding="utf-8"?>
+<rss version="2.0"><channel><title>Apache DataFusion Blog - Tim Saucer 
(rerun.io)</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Fri,
 20 Mar 2026 00:00:00 +0000</lastBuildDate><item><title>Writing Custom Table 
Providers in Apache 
DataFusion</title><link>https://datafusion.apache.org/blog/2026/03/20/writing-table-providers</link><description>&lt;!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+--&gt;
+
+&lt;p&gt;One of DataFusion's greatest strengths is its extensibility. If your 
data lives
+in a custom format, behind an API, or in a system that DataFusion does not
+natively support, you can teach DataFusion to read it by implementing a
+&lt;strong&gt;custom table provider&lt;/strong&gt;. This post walks through 
the three layers you …&lt;/p&gt;</description><dc:creator 
xmlns:dc="http://purl.org/dc/elements/1.1/";>Tim Saucer 
(rerun.io)</dc:creator><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid 
isPermaLink="false">tag:datafusion.apache.org,2026-03-20:/blog/2026/03/20/writing-table-providers</guid><category>blog</category></item></channel></rss>
\ No newline at end of file
diff --git a/blog/feeds/timsaucer.atom.xml b/blog/feeds/timsaucer.atom.xml
index 2637a9c..268635c 100644
--- a/blog/feeds/timsaucer.atom.xml
+++ b/blog/feeds/timsaucer.atom.xml
@@ -1,600 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
timsaucer</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/timsaucer.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2026-03-20T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Writing
 Custom Table Providers in Apache DataFusion</title><link 
href="https://datafusion.apache.org/blog/2026/03/2 [...]
-{% comment %}
-Licensed to the Apache Software Foundation (ASF) under one or more
-contributor license agreements.  See the NOTICE file distributed with
-this work for additional information regarding copyright ownership.
-The ASF licenses this file to you under the Apache License, Version 2.0
-(the "License"); you may not use this file except in compliance with
-the License.  You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-{% endcomment %}
---&gt;
-
-&lt;p&gt;One of DataFusion's greatest strengths is its extensibility. If your 
data lives
-in a custom format, behind an API, or in a system that DataFusion does not
-natively support, you can teach DataFusion to read it by implementing a
-&lt;strong&gt;custom table provider&lt;/strong&gt;. This post walks through 
the three layers you …&lt;/p&gt;</summary><content type="html">&lt;!--
-{% comment %}
-Licensed to the Apache Software Foundation (ASF) under one or more
-contributor license agreements.  See the NOTICE file distributed with
-this work for additional information regarding copyright ownership.
-The ASF licenses this file to you under the Apache License, Version 2.0
-(the "License"); you may not use this file except in compliance with
-the License.  You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-{% endcomment %}
---&gt;
-
-&lt;p&gt;One of DataFusion's greatest strengths is its extensibility. If your 
data lives
-in a custom format, behind an API, or in a system that DataFusion does not
-natively support, you can teach DataFusion to read it by implementing a
-&lt;strong&gt;custom table provider&lt;/strong&gt;. This post walks through 
the three layers you need to
-understand and explains where your work should actually happen.&lt;/p&gt;
-&lt;h2 id="the-three-layers"&gt;The Three Layers&lt;a class="headerlink" 
href="#the-three-layers" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;p&gt;When DataFusion executes a query against a table, three abstractions 
collaborate
-to produce results:&lt;/p&gt;
-&lt;ol&gt;
-&lt;li&gt;&lt;strong&gt;[&lt;code&gt;TableProvider&lt;/code&gt;]&lt;/strong&gt;
 -- Describes the table (schema, capabilities) and
-   produces an execution plan when queried.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;[&lt;code&gt;ExecutionPlan&lt;/code&gt;]&lt;/strong&gt;
 -- Describes &lt;em&gt;how&lt;/em&gt; to compute the result: partitioning,
-   ordering, and child plan relationships.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;[&lt;code&gt;SendableRecordBatchStream&lt;/code&gt;]&lt;/strong&gt;
 -- The async stream that &lt;em&gt;actually does the
-   work&lt;/em&gt;, yielding &lt;code&gt;RecordBatch&lt;/code&gt;es one at a 
time.&lt;/li&gt;
-&lt;/ol&gt;
-&lt;p&gt;Think of these as a funnel: 
&lt;code&gt;TableProvider::scan()&lt;/code&gt; is called once during
-planning to create an &lt;code&gt;ExecutionPlan&lt;/code&gt;, then 
&lt;code&gt;ExecutionPlan::execute()&lt;/code&gt; is called
-once per partition to create a stream, and those streams are where rows are
-actually produced during execution.&lt;/p&gt;
-&lt;h2 id="layer-1-tableprovider"&gt;Layer 1: TableProvider&lt;a 
class="headerlink" href="#layer-1-tableprovider" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;p&gt;A [&lt;code&gt;TableProvider&lt;/code&gt;] represents a queryable 
data source. For a minimal read-only
-table, you need four methods:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-rust"&gt;impl TableProvider for MyTable {
-    fn as_any(&amp;amp;self) -&amp;gt; &amp;amp;dyn Any { self }
-
-    fn schema(&amp;amp;self) -&amp;gt; SchemaRef {
-        Arc::clone(&amp;amp;self.schema)
-    }
-
-    fn table_type(&amp;amp;self) -&amp;gt; TableType {
-        TableType::Base
-    }
-
-    async fn scan(
-        &amp;amp;self,
-        state: &amp;amp;dyn Session,
-        projection: Option&amp;lt;&amp;amp;Vec&amp;lt;usize&amp;gt;&amp;gt;,
-        filters: &amp;amp;[Expr],
-        limit: Option&amp;lt;usize&amp;gt;,
-    ) -&amp;gt; Result&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt; {
-        // Build and return an ExecutionPlan -- keep this lightweight!
-        Ok(Arc::new(MyExecPlan::new(
-            Arc::clone(&amp;amp;self.schema),
-            projection,
-            limit,
-        )))
-    }
-}
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;The &lt;code&gt;scan&lt;/code&gt; method is the heart of 
&lt;code&gt;TableProvider&lt;/code&gt;. It receives three pushdown
-hints from the optimizer, each reducing the amount of data your source needs
-to produce:&lt;/p&gt;
-&lt;ul&gt;
-&lt;li&gt;&lt;strong&gt;&lt;code&gt;projection&lt;/code&gt;&lt;/strong&gt; -- 
Which columns are needed. This reduces the &lt;strong&gt;width&lt;/strong&gt; of
-  the output. If your source supports it, read only these columns rather than
-  the full schema.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;&lt;code&gt;filters&lt;/code&gt;&lt;/strong&gt; -- 
Predicates the engine would like you to apply during the
-  scan. This reduces the &lt;strong&gt;number of rows&lt;/strong&gt; by 
skipping data that does not
-  match. Implement &lt;code&gt;supports_filters_pushdown&lt;/code&gt; to 
advertise which filters you
-  can handle.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;&lt;code&gt;limit&lt;/code&gt;&lt;/strong&gt; -- A row 
count cap. This also reduces the &lt;strong&gt;number of rows&lt;/strong&gt; --
-  if you can stop reading early once you have produced enough rows, this avoids
-  unnecessary work.&lt;/li&gt;
-&lt;/ul&gt;
-&lt;h3 id="keep-scan-lightweight"&gt;Keep &lt;code&gt;scan()&lt;/code&gt; 
Lightweight&lt;a class="headerlink" href="#keep-scan-lightweight" 
title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;This is a critical point: 
&lt;strong&gt;&lt;code&gt;scan()&lt;/code&gt; runs during planning, not 
execution.&lt;/strong&gt; It
-should return quickly. Best practices are to avoid performing I/O, network
-calls, or heavy computation here. The &lt;code&gt;scan&lt;/code&gt; method's 
job is to &lt;em&gt;describe&lt;/em&gt; how
-the data will be produced, not to produce it. All the real work belongs in the
-stream (Layer 3).&lt;/p&gt;
-&lt;p&gt;A common pitfall is to fetch data or open connections in 
&lt;code&gt;scan()&lt;/code&gt;. This blocks
-the planning thread and can cause timeouts or deadlocks, especially if the 
query
-involves multiple tables or subqueries that all need to be planned before
-execution begins.&lt;/p&gt;
-&lt;h3 id="existing-implementations-to-learn-from"&gt;Existing Implementations 
to Learn From&lt;a class="headerlink" 
href="#existing-implementations-to-learn-from" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;DataFusion ships several &lt;code&gt;TableProvider&lt;/code&gt; 
implementations that are excellent
-references:&lt;/p&gt;
-&lt;ul&gt;
-&lt;li&gt;&lt;strong&gt;[&lt;code&gt;MemTable&lt;/code&gt;]&lt;/strong&gt; -- 
Holds data in memory as 
&lt;code&gt;Vec&amp;lt;RecordBatch&amp;gt;&lt;/code&gt;. The simplest
-  possible provider; great for tests and small datasets.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;[&lt;code&gt;StreamTable&lt;/code&gt;]&lt;/strong&gt; 
-- Wraps a user-provided stream factory. Useful when your
-  data arrives as a continuous stream (e.g., from Kafka or a 
socket).&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;[&lt;code&gt;SortedTableProvider&lt;/code&gt;]&lt;/strong&gt;
 -- Wraps another &lt;code&gt;TableProvider&lt;/code&gt; and advertises a
-  known sort order, enabling the optimizer to skip redundant sorts.&lt;/li&gt;
-&lt;/ul&gt;
-&lt;h2 id="layer-2-executionplan"&gt;Layer 2: ExecutionPlan&lt;a 
class="headerlink" href="#layer-2-executionplan" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;p&gt;An [&lt;code&gt;ExecutionPlan&lt;/code&gt;] is a node in the physical 
query plan tree. Your table
-provider's &lt;code&gt;scan()&lt;/code&gt; method returns one. The required 
methods are:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-rust"&gt;impl ExecutionPlan for MyExecPlan 
{
-    fn name(&amp;amp;self) -&amp;gt; &amp;amp;str { "MyExecPlan" }
-
-    fn as_any(&amp;amp;self) -&amp;gt; &amp;amp;dyn Any { self }
-
-    fn properties(&amp;amp;self) -&amp;gt; &amp;amp;PlanProperties {
-        &amp;amp;self.properties
-    }
-
-    fn children(&amp;amp;self) -&amp;gt; Vec&amp;lt;&amp;amp;Arc&amp;lt;dyn 
ExecutionPlan&amp;gt;&amp;gt; {
-        vec![]  // Leaf node -- no children
-    }
-
-    fn with_new_children(
-        self: Arc&amp;lt;Self&amp;gt;,
-        children: Vec&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt;,
-    ) -&amp;gt; Result&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt; {
-        assert!(children.is_empty());
-        Ok(self)
-    }
-
-    fn execute(
-        &amp;amp;self,
-        partition: usize,
-        context: Arc&amp;lt;TaskContext&amp;gt;,
-    ) -&amp;gt; Result&amp;lt;SendableRecordBatchStream&amp;gt; {
-        // This is where you build and return your stream
-        // ...
-    }
-}
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;The key properties to set correctly in 
[&lt;code&gt;PlanProperties&lt;/code&gt;] are &lt;strong&gt;output
-partitioning&lt;/strong&gt; and &lt;strong&gt;output 
ordering&lt;/strong&gt;.&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Output partitioning&lt;/strong&gt; tells the engine how 
many partitions your data has,
-which determines parallelism. If your source naturally partitions data (e.g.,
-by file or by shard), expose that here.&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Output ordering&lt;/strong&gt; declares whether your 
data is naturally sorted. This
-enables the optimizer to avoid inserting a &lt;code&gt;SortExec&lt;/code&gt; 
when a query requires
-ordered data. Getting this right can be a significant performance 
win.&lt;/p&gt;
-&lt;h3 id="partitioning-strategies"&gt;Partitioning Strategies&lt;a 
class="headerlink" href="#partitioning-strategies" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;Since &lt;code&gt;execute()&lt;/code&gt; is called once per 
partition, partitioning directly controls
-the parallelism of your table scan. Each partition runs on its own task, so
-more partitions means more concurrent work -- up to the number of available
-cores.&lt;/p&gt;
-&lt;p&gt;Consider how your data source naturally divides its data:&lt;/p&gt;
-&lt;ul&gt;
-&lt;li&gt;&lt;strong&gt;By file or object:&lt;/strong&gt; If you are reading 
from S3, each file can be a
-  partition. DataFusion will read them in parallel.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;By shard or region:&lt;/strong&gt; If your source is a 
sharded database, each shard
-  maps naturally to a partition.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;By key range:&lt;/strong&gt; If your data is keyed 
(e.g., by timestamp or customer ID),
-  you can split it into ranges.&lt;/li&gt;
-&lt;/ul&gt;
-&lt;p&gt;Getting partitioning right matters because it affects everything 
downstream in
-the plan. When DataFusion needs to perform an aggregation or join, it
-repartitions data by hashing the relevant columns. If your source already
-produces data partitioned by the join or group-by key, DataFusion can skip the
-repartition step entirely -- avoiding a potentially expensive 
shuffle.&lt;/p&gt;
-&lt;p&gt;For example, if you are building a table provider for a system that 
stores
-data partitioned by &lt;code&gt;customer_id&lt;/code&gt;, and a common query 
groups by &lt;code&gt;customer_id&lt;/code&gt;:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT customer_id, SUM(amount)
-FROM my_table
-GROUP BY customer_id;
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;If you declare your output partitioning as 
&lt;code&gt;Hash([customer_id], N)&lt;/code&gt;, the
-optimizer recognizes that the data is already distributed correctly for the
-aggregation and eliminates the &lt;code&gt;RepartitionExec&lt;/code&gt; that 
would otherwise appear
-in the plan. You can verify this with &lt;code&gt;EXPLAIN&lt;/code&gt; (more 
on this below).&lt;/p&gt;
-&lt;p&gt;Conversely, if you report 
&lt;code&gt;UnknownPartitioning&lt;/code&gt;, DataFusion must assume the
-worst case and will always insert repartitioning operators as needed.&lt;/p&gt;
-&lt;h3 id="keep-execute-lightweight-too"&gt;Keep 
&lt;code&gt;execute()&lt;/code&gt; Lightweight Too&lt;a class="headerlink" 
href="#keep-execute-lightweight-too" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;Like &lt;code&gt;scan()&lt;/code&gt;, the 
&lt;code&gt;execute()&lt;/code&gt; method should construct and return a stream
-without doing heavy work. The actual data production happens when the stream
-is polled. Do not block on async operations here -- build the stream and let
-the runtime drive it.&lt;/p&gt;
-&lt;h3 id="existing-implementations-to-learn-from_1"&gt;Existing 
Implementations to Learn From&lt;a class="headerlink" 
href="#existing-implementations-to-learn-from_1" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;ul&gt;
-&lt;li&gt;&lt;strong&gt;[&lt;code&gt;StreamingTableExec&lt;/code&gt;]&lt;/strong&gt;
 -- Executes a streaming table scan. It takes a
-  stream factory (a closure that produces streams) and handles partitioning.
-  Good reference for wrapping external streams.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;[&lt;code&gt;DataSourceExec&lt;/code&gt;]&lt;/strong&gt;
 -- The execution plan behind DataFusion's built-in file
-  scanning (Parquet, CSV, JSON). It demonstrates sophisticated partitioning,
-  filter pushdown, and projection pushdown.&lt;/li&gt;
-&lt;/ul&gt;
-&lt;h2 id="layer-3-sendablerecordbatchstream"&gt;Layer 3: 
SendableRecordBatchStream&lt;a class="headerlink" 
href="#layer-3-sendablerecordbatchstream" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;p&gt;[&lt;code&gt;SendableRecordBatchStream&lt;/code&gt;] is where the 
real work happens. It is defined as:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-rust"&gt;type SendableRecordBatchStream =
-    Pin&amp;lt;Box&amp;lt;dyn RecordBatchStream&amp;lt;Item = 
Result&amp;lt;RecordBatch&amp;gt;&amp;gt; + Send&amp;gt;&amp;gt;;
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;This is an async stream of &lt;code&gt;RecordBatch&lt;/code&gt;es 
that can be sent across threads. When
-the DataFusion runtime polls this stream, your code runs: reading files, 
calling
-APIs, transforming data, etc.&lt;/p&gt;
-&lt;h3 id="using-recordbatchstreamadapter"&gt;Using 
RecordBatchStreamAdapter&lt;a class="headerlink" 
href="#using-recordbatchstreamadapter" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;The easiest way to create a 
&lt;code&gt;SendableRecordBatchStream&lt;/code&gt; is with
-[&lt;code&gt;RecordBatchStreamAdapter&lt;/code&gt;]. It bridges any 
&lt;code&gt;futures::Stream&amp;lt;Item =
-Result&amp;lt;RecordBatch&amp;gt;&amp;gt;&lt;/code&gt; into the 
&lt;code&gt;SendableRecordBatchStream&lt;/code&gt; type:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-rust"&gt;use 
datafusion::physical_plan::stream::RecordBatchStreamAdapter;
-
-fn execute(
-    &amp;amp;self,
-    partition: usize,
-    context: Arc&amp;lt;TaskContext&amp;gt;,
-) -&amp;gt; Result&amp;lt;SendableRecordBatchStream&amp;gt; {
-    let schema = self.schema();
-    let config = self.config.clone();
-
-    let stream = futures::stream::once(async move {
-        // ALL the heavy work happens here, inside the stream:
-        // - Open connections
-        // - Read data from external sources
-        // - Transform and batch the results
-        let batches = fetch_data_from_source(&amp;amp;config).await?;
-        Ok(batches)
-    })
-    .flat_map(|result| match result {
-        Ok(batch) =&amp;gt; futures::stream::iter(vec![Ok(batch)]),
-        Err(e) =&amp;gt; futures::stream::iter(vec![Err(e)]),
-    });
-
-    Ok(Box::pin(RecordBatchStreamAdapter::new(schema, stream)))
-}
-&lt;/code&gt;&lt;/pre&gt;
-&lt;h3 id="cpu-intensive-work-use-a-separate-thread-pool"&gt;CPU-Intensive 
Work: Use a Separate Thread Pool&lt;a class="headerlink" 
href="#cpu-intensive-work-use-a-separate-thread-pool" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;If your stream performs CPU-intensive work (parsing, decompression, 
complex
-transformations), avoid blocking the tokio runtime. Instead, offload to a
-dedicated thread pool and send results back through a channel:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-rust"&gt;fn execute(
-    &amp;amp;self,
-    partition: usize,
-    context: Arc&amp;lt;TaskContext&amp;gt;,
-) -&amp;gt; Result&amp;lt;SendableRecordBatchStream&amp;gt; {
-    let schema = self.schema();
-    let config = self.config.clone();
-
-    let (tx, rx) = tokio::sync::mpsc::channel(2);
-
-    // Spawn CPU-heavy work on a blocking thread pool
-    tokio::task::spawn_blocking(move || {
-        let batches = generate_data(&amp;amp;config);
-        for batch in batches {
-            if tx.blocking_send(Ok(batch)).is_err() {
-                break; // Receiver dropped, query was cancelled
-            }
-        }
-    });
-
-    let stream = tokio_stream::wrappers::ReceiverStream::new(rx);
-    Ok(Box::pin(RecordBatchStreamAdapter::new(schema, stream)))
-}
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;This pattern keeps the async runtime responsive while your data 
generation
-runs on its own threads.&lt;/p&gt;
-&lt;h2 id="where-should-the-work-happen"&gt;Where Should the Work Happen?&lt;a 
class="headerlink" href="#where-should-the-work-happen" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;p&gt;This table summarizes what belongs at each layer:&lt;/p&gt;
-&lt;table class="table"&gt;
-&lt;thead&gt;
-&lt;tr&gt;
-&lt;th&gt;Layer&lt;/th&gt;
-&lt;th&gt;Runs During&lt;/th&gt;
-&lt;th&gt;Should Do&lt;/th&gt;
-&lt;th&gt;Should NOT Do&lt;/th&gt;
-&lt;/tr&gt;
-&lt;/thead&gt;
-&lt;tbody&gt;
-&lt;tr&gt;
-&lt;td&gt;&lt;code&gt;TableProvider::scan()&lt;/code&gt;&lt;/td&gt;
-&lt;td&gt;Planning&lt;/td&gt;
-&lt;td&gt;Build an &lt;code&gt;ExecutionPlan&lt;/code&gt; with 
metadata&lt;/td&gt;
-&lt;td&gt;I/O, network calls, heavy computation&lt;/td&gt;
-&lt;/tr&gt;
-&lt;tr&gt;
-&lt;td&gt;&lt;code&gt;ExecutionPlan::execute()&lt;/code&gt;&lt;/td&gt;
-&lt;td&gt;Execution (once per partition)&lt;/td&gt;
-&lt;td&gt;Construct a stream, set up channels&lt;/td&gt;
-&lt;td&gt;Block on async work, read data&lt;/td&gt;
-&lt;/tr&gt;
-&lt;tr&gt;
-&lt;td&gt;&lt;code&gt;RecordBatchStream&lt;/code&gt; (polling)&lt;/td&gt;
-&lt;td&gt;Execution&lt;/td&gt;
-&lt;td&gt;All I/O, computation, data production&lt;/td&gt;
-&lt;td&gt;--&lt;/td&gt;
-&lt;/tr&gt;
-&lt;/tbody&gt;
-&lt;/table&gt;
-&lt;p&gt;The guiding principle: &lt;strong&gt;push work as late as 
possible.&lt;/strong&gt; Planning should be
-fast so the optimizer can do its job. Execution setup should be fast so all
-partitions can start promptly. The stream is where you spend time producing
-data.&lt;/p&gt;
-&lt;h3 id="why-this-matters"&gt;Why This Matters&lt;a class="headerlink" 
href="#why-this-matters" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;When &lt;code&gt;scan()&lt;/code&gt; does heavy work, several 
problems arise:&lt;/p&gt;
-&lt;ol&gt;
-&lt;li&gt;&lt;strong&gt;Planning becomes slow.&lt;/strong&gt; If a query 
touches 10 tables and each &lt;code&gt;scan()&lt;/code&gt;
-   takes 500ms, planning alone takes 5 seconds before any data 
flows.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;The optimizer cannot help.&lt;/strong&gt; The 
optimizer runs between planning and
-   execution. If you have already fetched data during planning, optimizations
-   like predicate pushdown or partition pruning cannot reduce the 
work.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;Resource management breaks down.&lt;/strong&gt; 
DataFusion manages concurrency and
-   memory during execution. Work done during planning bypasses these 
controls.&lt;/li&gt;
-&lt;/ol&gt;
-&lt;h2 id="filter-pushdown-doing-less-work"&gt;Filter Pushdown: Doing Less 
Work&lt;a class="headerlink" href="#filter-pushdown-doing-less-work" 
title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;p&gt;One of the most impactful optimizations you can add to a custom table 
provider
-is &lt;strong&gt;filter pushdown&lt;/strong&gt; -- letting the source skip 
data that the query does not
-need, rather than reading everything and filtering it afterward.&lt;/p&gt;
-&lt;h3 id="how-filter-pushdown-works"&gt;How Filter Pushdown Works&lt;a 
class="headerlink" href="#how-filter-pushdown-works" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;When DataFusion plans a query with a &lt;code&gt;WHERE&lt;/code&gt; 
clause, it passes the filter
-predicates to your &lt;code&gt;scan()&lt;/code&gt; method as the 
&lt;code&gt;filters&lt;/code&gt; parameter. By default,
-DataFusion assumes your provider cannot handle any filters and inserts a
-&lt;code&gt;FilterExec&lt;/code&gt; node above your scan to apply them. But if 
your source &lt;em&gt;can&lt;/em&gt;
-evaluate some predicates during scanning -- for example, by skipping files,
-partitions, or row groups that cannot match -- you can eliminate a huge amount
-of unnecessary I/O.&lt;/p&gt;
-&lt;p&gt;To opt in, implement 
&lt;code&gt;supports_filters_pushdown&lt;/code&gt;:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-rust"&gt;fn supports_filters_pushdown(
-    &amp;amp;self,
-    filters: &amp;amp;[&amp;amp;Expr],
-) -&amp;gt; 
Result&amp;lt;Vec&amp;lt;TableProviderFilterPushDown&amp;gt;&amp;gt; {
-    Ok(filters.iter().map(|f| {
-        match f {
-            // We can fully evaluate equality filters on
-            // the partition column at the source
-            Expr::BinaryExpr(BinaryExpr {
-                left, op: Operator::Eq, right
-            }) if is_partition_column(left) || is_partition_column(right) 
=&amp;gt; {
-                TableProviderFilterPushDown::Exact
-            }
-            // All other filters: let DataFusion handle them
-            _ =&amp;gt; TableProviderFilterPushDown::Unsupported,
-        }
-    }).collect())
-}
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;The three possible responses for each filter are:&lt;/p&gt;
-&lt;ul&gt;
-&lt;li&gt;&lt;strong&gt;&lt;code&gt;Exact&lt;/code&gt;&lt;/strong&gt; -- Your 
source guarantees that no output rows will have a false
-  value for this predicate. Because the filter is fully evaluated at the 
source,
-  DataFusion will &lt;strong&gt;not&lt;/strong&gt; add a 
&lt;code&gt;FilterExec&lt;/code&gt; for it.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;&lt;code&gt;Inexact&lt;/code&gt;&lt;/strong&gt; -- 
Your source has the ability to reduce the data produced, but
-  the output may still include rows that do not satisfy the predicate. For
-  example, you might skip entire files based on metadata statistics but not
-  filter individual rows within a file. DataFusion will still add a 
&lt;code&gt;FilterExec&lt;/code&gt;
-  above your scan to remove any remaining rows that slipped through.&lt;/li&gt;
-&lt;li&gt;&lt;strong&gt;&lt;code&gt;Unsupported&lt;/code&gt;&lt;/strong&gt; -- 
Your source ignores this filter entirely. DataFusion
-  handles it.&lt;/li&gt;
-&lt;/ul&gt;
-&lt;h3 id="why-filter-pushdown-matters"&gt;Why Filter Pushdown Matters&lt;a 
class="headerlink" href="#why-filter-pushdown-matters" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;Consider a table with 1 billion rows partitioned by 
&lt;code&gt;region&lt;/code&gt;, and a query:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT * FROM events WHERE region 
= 'us-east-1' AND event_type = 'click';
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;&lt;strong&gt;Without filter pushdown:&lt;/strong&gt; Your table 
provider reads all 1 billion rows
-across all regions. DataFusion then applies both filters, discarding the vast
-majority of the data.&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;With filter pushdown on 
&lt;code&gt;region&lt;/code&gt;:&lt;/strong&gt; Your 
&lt;code&gt;scan()&lt;/code&gt; method sees the
-&lt;code&gt;region = 'us-east-1'&lt;/code&gt; filter and constructs an 
execution plan that only reads
-the &lt;code&gt;us-east-1&lt;/code&gt; partition. If that partition holds 100 
million rows, you have
-just eliminated 90% of the I/O. DataFusion still applies the 
&lt;code&gt;event_type&lt;/code&gt;
-filter via &lt;code&gt;FilterExec&lt;/code&gt; if you reported it as 
&lt;code&gt;Unsupported&lt;/code&gt;.&lt;/p&gt;
-&lt;h3 id="using-explain-to-debug-your-table-provider"&gt;Using EXPLAIN to 
Debug Your Table Provider&lt;a class="headerlink" 
href="#using-explain-to-debug-your-table-provider" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
-&lt;p&gt;The &lt;code&gt;EXPLAIN&lt;/code&gt; statement is your best tool for 
understanding what DataFusion is
-actually doing with your table provider. It shows the physical plan that
-DataFusion will execute, including any operators it inserted:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-sql"&gt;EXPLAIN SELECT * FROM events WHERE 
region = 'us-east-1' AND event_type = 'click';
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;If you are using DataFrames, call &lt;code&gt;.explain(false, 
false)&lt;/code&gt; for the logical plan
-or &lt;code&gt;.explain(false, true)&lt;/code&gt; for the physical plan. You 
can also print the plans
-in verbose mode with &lt;code&gt;.explain(true, true)&lt;/code&gt;.&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Before filter pushdown&lt;/strong&gt;, the plan might 
look like:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-text"&gt;FilterExec: region@0 = us-east-1 
AND event_type@1 = click
-  MyExecPlan: partitions=50
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;Here DataFusion is reading all 50 partitions and filtering everything
-afterward. The &lt;code&gt;FilterExec&lt;/code&gt; above your scan is doing 
all the predicate work.&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;After implementing pushdown for 
&lt;code&gt;region&lt;/code&gt;&lt;/strong&gt; (reported as 
&lt;code&gt;Exact&lt;/code&gt;):&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-text"&gt;FilterExec: event_type@1 = click
-  MyExecPlan: partitions=5, filter=[region = us-east-1]
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;Now your exec reads only the 5 partitions for 
&lt;code&gt;us-east-1&lt;/code&gt;, and the remaining
-&lt;code&gt;FilterExec&lt;/code&gt; only handles the 
&lt;code&gt;event_type&lt;/code&gt; predicate. The 
&lt;code&gt;region&lt;/code&gt; filter has
-been fully absorbed by your scan.&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;After implementing pushdown for both 
filters&lt;/strong&gt; (both &lt;code&gt;Exact&lt;/code&gt;):&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-text"&gt;MyExecPlan: partitions=5, 
filter=[region = us-east-1 AND event_type = click]
-&lt;/code&gt;&lt;/pre&gt;
-&lt;p&gt;No &lt;code&gt;FilterExec&lt;/code&gt; at all -- your source handles 
everything.&lt;/p&gt;
-&lt;p&gt;Similarly, &lt;code&gt;EXPLAIN&lt;/code&gt; will reveal whether 
DataFusion is inserting unnecessary
-&lt;code&gt;SortExec&lt;/code&gt; or &lt;code&gt;RepartitionExec&lt;/code&gt; 
nodes that you could eliminate by declaring
-better output properties. Whenever your queries seem slower than expected,
-&lt;code&gt;EXPLAIN&lt;/code&gt; is the first place to look.&lt;/p&gt;
-&lt;h2 id="putting-it-all-together"&gt;Putting It All Together&lt;a 
class="headerlink" href="#putting-it-all-together" title="Permanent 
link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;p&gt;Here is a minimal but complete example of a custom table provider 
that generates
-data lazily during streaming:&lt;/p&gt;
-&lt;pre&gt;&lt;code class="language-rust"&gt;use std::any::Any;
-use std::sync::Arc;
-
-use arrow::array::{Int64Array, StringArray};
-use arrow::datatypes::{DataType, Field, Schema, SchemaRef};
-use arrow::record_batch::RecordBatch;
-use datafusion::catalog::TableProvider;
-use datafusion::common::Result;
-use datafusion::datasource::TableType;
-use datafusion::execution::context::SessionState;
-use datafusion::execution::SendableRecordBatchStream;
-use datafusion::logical_expr::Expr;
-use datafusion::physical_expr::EquivalenceProperties;
-use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType};
-use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
-use datafusion::physical_plan::{
-    ExecutionPlan, Partitioning, PlanProperties,
-};
-use futures::stream;
-
-/// A table provider that generates sequential numbers on demand.
-struct CountingTable {
-    schema: SchemaRef,
-    num_partitions: usize,
-    rows_per_partition: usize,
-}
-
-impl CountingTable {
-    fn new(num_partitions: usize, rows_per_partition: usize) -&amp;gt; Self {
-        let schema = Arc::new(Schema::new(vec![
-            Field::new("partition", DataType::Int64, false),
-            Field::new("value", DataType::Int64, false),
-        ]));
-        Self { schema, num_partitions, rows_per_partition }
-    }
-}
-
-#[async_trait::async_trait]
-impl TableProvider for CountingTable {
-    fn as_any(&amp;amp;self) -&amp;gt; &amp;amp;dyn Any { self }
-    fn schema(&amp;amp;self) -&amp;gt; SchemaRef { 
Arc::clone(&amp;amp;self.schema) }
-    fn table_type(&amp;amp;self) -&amp;gt; TableType { TableType::Base }
-
-    async fn scan(
-        &amp;amp;self,
-        _state: &amp;amp;dyn Session,
-        projection: Option&amp;lt;&amp;amp;Vec&amp;lt;usize&amp;gt;&amp;gt;,
-        _filters: &amp;amp;[Expr],
-        limit: Option&amp;lt;usize&amp;gt;,
-    ) -&amp;gt; Result&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt; {
-        // Light work only: build the plan with metadata
-        Ok(Arc::new(CountingExec {
-            schema: Arc::clone(&amp;amp;self.schema),
-            num_partitions: self.num_partitions,
-            rows_per_partition: limit
-                .unwrap_or(self.rows_per_partition)
-                .min(self.rows_per_partition),
-            properties: PlanProperties::new(
-                EquivalenceProperties::new(Arc::clone(&amp;amp;self.schema)),
-                Partitioning::UnknownPartitioning(self.num_partitions),
-                EmissionType::Incremental,
-                Boundedness::Bounded,
-            ),
-        }))
-    }
-}
-
-struct CountingExec {
-    schema: SchemaRef,
-    num_partitions: usize,
-    rows_per_partition: usize,
-    properties: PlanProperties,
-}
-
-impl ExecutionPlan for CountingExec {
-    fn name(&amp;amp;self) -&amp;gt; &amp;amp;str { "CountingExec" }
-    fn as_any(&amp;amp;self) -&amp;gt; &amp;amp;dyn Any { self }
-    fn properties(&amp;amp;self) -&amp;gt; &amp;amp;PlanProperties { 
&amp;amp;self.properties }
-    fn children(&amp;amp;self) -&amp;gt; Vec&amp;lt;&amp;amp;Arc&amp;lt;dyn 
ExecutionPlan&amp;gt;&amp;gt; { vec![] }
-
-    fn with_new_children(
-        self: Arc&amp;lt;Self&amp;gt;,
-        _children: Vec&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt;,
-    ) -&amp;gt; Result&amp;lt;Arc&amp;lt;dyn ExecutionPlan&amp;gt;&amp;gt; {
-        Ok(self)
-    }
-
-    fn execute(
-        &amp;amp;self,
-        partition: usize,
-        _context: Arc&amp;lt;TaskContext&amp;gt;,
-    ) -&amp;gt; Result&amp;lt;SendableRecordBatchStream&amp;gt; {
-        let schema = Arc::clone(&amp;amp;self.schema);
-        let rows = self.rows_per_partition;
-
-        // The heavy work (data generation) happens inside the stream,
-        // not here in execute().
-        let batch_stream = stream::once(async move {
-            let partitions = Int64Array::from(
-                vec![partition as i64; rows],
-            );
-            let values = Int64Array::from(
-                (0..rows as 
i64).collect::&amp;lt;Vec&amp;lt;_&amp;gt;&amp;gt;(),
-            );
-            let batch = RecordBatch::try_new(
-                Arc::clone(&amp;amp;schema),
-                vec![Arc::new(partitions), Arc::new(values)],
-            )?;
-            Ok(batch)
-        });
-
-        Ok(Box::pin(RecordBatchStreamAdapter::new(
-            Arc::clone(&amp;amp;self.schema),
-            batch_stream,
-        )))
-    }
-}
-&lt;/code&gt;&lt;/pre&gt;
-&lt;h2 id="choosing-the-right-starting-point"&gt;Choosing the Right Starting 
Point&lt;a class="headerlink" href="#choosing-the-right-starting-point" 
title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;p&gt;Not every custom data source requires implementing all three layers 
from
-scratch. DataFusion provides building blocks that let you plug in at whatever
-level makes sense:&lt;/p&gt;
-&lt;table class="table"&gt;
-&lt;thead&gt;
-&lt;tr&gt;
-&lt;th&gt;If your data is...&lt;/th&gt;
-&lt;th&gt;Start with&lt;/th&gt;
-&lt;th&gt;You implement&lt;/th&gt;
-&lt;/tr&gt;
-&lt;/thead&gt;
-&lt;tbody&gt;
-&lt;tr&gt;
-&lt;td&gt;Already in &lt;code&gt;RecordBatch&lt;/code&gt;es in 
memory&lt;/td&gt;
-&lt;td&gt;[&lt;code&gt;MemTable&lt;/code&gt;]&lt;/td&gt;
-&lt;td&gt;Nothing -- just construct it&lt;/td&gt;
-&lt;/tr&gt;
-&lt;tr&gt;
-&lt;td&gt;An async stream of batches&lt;/td&gt;
-&lt;td&gt;[&lt;code&gt;StreamTable&lt;/code&gt;]&lt;/td&gt;
-&lt;td&gt;A stream factory&lt;/td&gt;
-&lt;/tr&gt;
-&lt;tr&gt;
-&lt;td&gt;A table with known sort order&lt;/td&gt;
-&lt;td&gt;[&lt;code&gt;SortedTableProvider&lt;/code&gt;] wrapping another 
provider&lt;/td&gt;
-&lt;td&gt;The inner provider&lt;/td&gt;
-&lt;/tr&gt;
-&lt;tr&gt;
-&lt;td&gt;A custom source needing full control&lt;/td&gt;
-&lt;td&gt;&lt;code&gt;TableProvider&lt;/code&gt; + 
&lt;code&gt;ExecutionPlan&lt;/code&gt; + stream&lt;/td&gt;
-&lt;td&gt;All three layers&lt;/td&gt;
-&lt;/tr&gt;
-&lt;/tbody&gt;
-&lt;/table&gt;
-&lt;p&gt;For most integrations, [&lt;code&gt;StreamTable&lt;/code&gt;] 
combined with
-[&lt;code&gt;RecordBatchStreamAdapter&lt;/code&gt;] provides a good balance of 
simplicity and
-flexibility. You provide a closure that returns a stream, and DataFusion 
handles
-the rest.&lt;/p&gt;
-&lt;h2 id="further-reading"&gt;Further Reading&lt;a class="headerlink" 
href="#further-reading" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
-&lt;hr/&gt;
-&lt;ul&gt;
-&lt;li&gt;[TableProvider API 
docs][&lt;code&gt;TableProvider&lt;/code&gt;]&lt;/li&gt;
-&lt;li&gt;[ExecutionPlan API 
docs][&lt;code&gt;ExecutionPlan&lt;/code&gt;]&lt;/li&gt;
-&lt;li&gt;[SendableRecordBatchStream API 
docs][&lt;code&gt;SendableRecordBatchStream&lt;/code&gt;]&lt;/li&gt;
-&lt;li&gt;&lt;a 
href="https://github.com/apache/datafusion/issues/16821"&gt;GitHub issue 
discussing table provider examples&lt;/a&gt;&lt;/li&gt;
-&lt;li&gt;&lt;a 
href="https://github.com/apache/datafusion/tree/main/datafusion-examples/examples"&gt;DataFusion
 examples directory&lt;/a&gt; --
-  contains working examples including custom table providers&lt;/li&gt;
-&lt;/ul&gt;
-&lt;hr/&gt;
-&lt;p&gt;&lt;em&gt;Note: Portions of this blog post were written with the 
assistance of an AI agent.&lt;/em&gt;&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Apache DataFusion Python 46.0.0 
Released</title><link 
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0"; 
rel="alternate"></link><published>2025-03-30T00:00:00+00:00</published><updated>2025-03-30T00:00:00+00:00</updated><author><name>timsaucer</name></author><id>tag:datafusion.a
 [...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
timsaucer</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/timsaucer.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-30T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
 DataFusion Python 46.0.0 Released</title><link 
href="https://datafusion.apache.org/blog/2025/03/30/datafusio [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/blog/feeds/timsaucer.rss.xml b/blog/feeds/timsaucer.rss.xml
index 4274e9d..22d32ef 100644
--- a/blog/feeds/timsaucer.rss.xml
+++ b/blog/feeds/timsaucer.rss.xml
@@ -1,27 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion Blog - 
timsaucer</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Fri,
 20 Mar 2026 00:00:00 +0000</lastBuildDate><item><title>Writing Custom Table 
Providers in Apache 
DataFusion</title><link>https://datafusion.apache.org/blog/2026/03/20/writing-table-providers</link><description>&lt;!--
-{% comment %}
-Licensed to the Apache Software Foundation (ASF) under one or more
-contributor license agreements.  See the NOTICE file distributed with
-this work for additional information regarding copyright ownership.
-The ASF licenses this file to you under the Apache License, Version 2.0
-(the "License"); you may not use this file except in compliance with
-the License.  You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-{% endcomment %}
---&gt;
-
-&lt;p&gt;One of DataFusion's greatest strengths is its extensibility. If your 
data lives
-in a custom format, behind an API, or in a system that DataFusion does not
-natively support, you can teach DataFusion to read it by implementing a
-&lt;strong&gt;custom table provider&lt;/strong&gt;. This post walks through 
the three layers you …&lt;/p&gt;</description><dc:creator 
xmlns:dc="http://purl.org/dc/elements/1.1/";>timsaucer</dc:creator><pubDate>Fri, 
20 Mar 2026 00:00:00 +0000</pubDate><guid 
isPermaLink="false">tag:datafusion.apache.org,2026-03-20:/blog/2026/03/20/writing-table-providers</guid><category>blog</category></item><item><title>Apache
 DataFusion Python 46.0.0 
Released</title><link>https://datafusion.apache.org/blo [...]
+<rss version="2.0"><channel><title>Apache DataFusion Blog - 
timsaucer</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Sun,
 30 Mar 2025 00:00:00 +0000</lastBuildDate><item><title>Apache DataFusion 
Python 46.0.0 
Released</title><link>https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0</link><description>&lt;!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/blog/index.html b/blog/index.html
index b41526b..27cc6d9 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -117,7 +117,7 @@ figcaption {
                 <header>
                     <div class="title">
                         <h1><a 
href="/blog/2026/03/20/writing-table-providers">Writing Custom Table Providers 
in Apache DataFusion</a></h1>
-                        <p>Posted on: Fri 20 March 2026 by timsaucer</p>
+                        <p>Posted on: Fri 20 March 2026 by Tim Saucer 
(rerun.io)</p>
                         <p><!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-site) branch asf-staging updated: Commit build products

Reply via email to