This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-staging by this push: new 7482f2d Commit build products 7482f2d is described below commit 7482f2d58a8618ce60a865c9d19a4ad607068d3e Author: Build Pelican (action) <priv...@infra.apache.org> AuthorDate: Tue Aug 12 13:53:56 2025 +0000 Commit build products --- blog/2025/08/15/external-parquet-indexes/index.html | 17 ++++++++++++----- blog/author/andrew-lamb-influxdata.html | 1 + blog/category/blog.html | 1 + blog/feed.xml | 1 + blog/feeds/all-en.atom.xml | 18 +++++++++++++----- blog/feeds/andrew-lamb-influxdata.atom.xml | 18 +++++++++++++----- blog/feeds/andrew-lamb-influxdata.rss.xml | 1 + blog/feeds/blog.atom.xml | 18 +++++++++++++----- blog/index.html | 1 + 9 files changed, 56 insertions(+), 20 deletions(-) diff --git a/blog/2025/08/15/external-parquet-indexes/index.html b/blog/2025/08/15/external-parquet-indexes/index.html index 5617a0e..bd9d8a8 100644 --- a/blog/2025/08/15/external-parquet-indexes/index.html +++ b/blog/2025/08/15/external-parquet-indexes/index.html @@ -61,6 +61,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with @@ -243,22 +244,28 @@ Please refer to the <a href="https://datafusion.apache.org/blog/2025/03/21/parqu indexes, as described in the next sections.</strong></p> <h2>Pruning Files with External Indexes</h2> <p>The first step in hierarchical pruning is quickly ruling out files that cannot -match the query. For example, if a system expects to have see queries that +match the query. For example, if a system expects to see queries that apply to a time range, it might create an external index to store the minimum and maximum <code>time</code> values for each file. Then, during query processing, the -system can quickly rule out files that cannot possibly contain relevant data. -For example, if the user issues a query that only matches the last 7 days of +system can quickly rule out files that cannot possibly contain relevant data.</p> +<p>For example, if the user issues a query that only matches the last 7 days of data:</p> <pre><code class="language-sql">WHERE time > now() - interval '7 days' </code></pre> <p>The index can quickly rule out files that only have data older than 7 days.</p> -<!-- TODO update the diagram to match the example above -- and have time predicates --> <div class="text-center"> <img alt="Data Skipping: Pruning Files." class="img-responsive" src="/blog/images/external-parquet-indexes/prune-files.png" width="80%"/> </div> <p><strong>Figure 6</strong>: Step 1: File Pruning. Given a query predicate, systems use external indexes to quickly rule out files that cannot match the query. In this case, by consulting the index all but two files can be ruled out.</p> +<p>External indexes offer much faster lookups and lower I/O overhead than Parquet's +built-in file-level indexes by skipping further processing for many data files. +Without an external index, systems typically fall back to reading each file's +footer to find files needed for further processing. Skipping per-file processing +is especially important when reading from remote object stores such as <a href="https://aws.amazon.com/s3/">S3</a>, +<a href="https://cloud.google.com/storage">GCS</a> or <a href="https://azure.microsoft.com/en-us/services/storage/blobs/">Azure Blob Store</a>, where each request adds [tens to hundreds of +milliseconds of latency].</p> <p>There are many different systems that use external indexes to find files such as <a href="https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore">Hive Metadata Store</a>, <a href="https://iceberg.apache.org/">Iceberg</a>, @@ -581,7 +588,7 @@ execution works, help document or improve the DataFusion codebase, or just try it out, we would love for you to join us.</p> <h3>Footnotes</h3> <p><a id="footnote1"></a><code>1</code>: This trend is described in more detail in the <a href="https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/">FDAP Stack</a> blog</p> -<p><a id="footnote2"></a><code>2</code>: This layout is referred to a <a href="https://www.vldb.org/conf/2001/P169.pdf">PAX in the +<p><a id="footnote2"></a><code>2</code>: This layout is referred to as <a href="https://www.vldb.org/conf/2001/P169.pdf">PAX in the database literature</a> after the first research paper to describe the technique.</p> <p><a id="footnote3"></a><code>3</code>: Benchmaxxing (verb): to add specific optimizations that only impact benchmark results and are not widely applicable to real world use cases.</p> diff --git a/blog/author/andrew-lamb-influxdata.html b/blog/author/andrew-lamb-influxdata.html index 63a6f3e..eaa8d31 100644 --- a/blog/author/andrew-lamb-influxdata.html +++ b/blog/author/andrew-lamb-influxdata.html @@ -46,6 +46,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with diff --git a/blog/category/blog.html b/blog/category/blog.html index 9bd7bbb..09687d8 100644 --- a/blog/category/blog.html +++ b/blog/category/blog.html @@ -47,6 +47,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with diff --git a/blog/feed.xml b/blog/feed.xml index 200a45e..0fe45fc 100644 --- a/blog/feed.xml +++ b/blog/feed.xml @@ -17,6 +17,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml index 9e861ec..caefa20 100644 --- a/blog/feeds/all-en.atom.xml +++ b/blog/feeds/all-en.atom.xml @@ -17,6 +17,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with @@ -40,6 +41,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with @@ -222,22 +224,28 @@ Please refer to the <a href="https://datafusion.apache.org/blog/2025/03/21/pa indexes, as described in the next sections.</strong></p> <h2>Pruning Files with External Indexes</h2> <p>The first step in hierarchical pruning is quickly ruling out files that cannot -match the query. For example, if a system expects to have see queries that +match the query. For example, if a system expects to see queries that apply to a time range, it might create an external index to store the minimum and maximum <code>time</code> values for each file. Then, during query processing, the -system can quickly rule out files that cannot possibly contain relevant data. -For example, if the user issues a query that only matches the last 7 days of +system can quickly rule out files that cannot possibly contain relevant data.</p> +<p>For example, if the user issues a query that only matches the last 7 days of data:</p> <pre><code class="language-sql">WHERE time &gt; now() - interval '7 days' </code></pre> <p>The index can quickly rule out files that only have data older than 7 days.</p> -<!-- TODO update the diagram to match the example above -- and have time predicates --> <div class="text-center"> <img alt="Data Skipping: Pruning Files." class="img-responsive" src="/blog/images/external-parquet-indexes/prune-files.png" width="80%"/> </div> <p><strong>Figure 6</strong>: Step 1: File Pruning. Given a query predicate, systems use external indexes to quickly rule out files that cannot match the query. In this case, by consulting the index all but two files can be ruled out.</p> +<p>External indexes offer much faster lookups and lower I/O overhead than Parquet's +built-in file-level indexes by skipping further processing for many data files. +Without an external index, systems typically fall back to reading each file's +footer to find files needed for further processing. Skipping per-file processing +is especially important when reading from remote object stores such as <a href="https://aws.amazon.com/s3/">S3</a>, +<a href="https://cloud.google.com/storage">GCS</a> or <a href="https://azure.microsoft.com/en-us/services/storage/blobs/">Azure Blob Store</a>, where each request adds [tens to hundreds of +milliseconds of latency].</p> <p>There are many different systems that use external indexes to find files such as <a href="https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore">Hive Metadata Store</a>, <a href="https://iceberg.apache.org/">Iceberg</a>, @@ -560,7 +568,7 @@ execution works, help document or improve the DataFusion codebase, or just try it out, we would love for you to join us.</p> <h3>Footnotes</h3> <p><a id="footnote1"></a><code>1</code>: This trend is described in more detail in the <a href="https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/">FDAP Stack</a> blog</p> -<p><a id="footnote2"></a><code>2</code>: This layout is referred to a <a href="https://www.vldb.org/conf/2001/P169.pdf">PAX in the +<p><a id="footnote2"></a><code>2</code>: This layout is referred to as <a href="https://www.vldb.org/conf/2001/P169.pdf">PAX in the database literature</a> after the first research paper to describe the technique.</p> <p><a id="footnote3"></a><code>3</code>: Benchmaxxing (verb): to add specific optimizations that only impact benchmark results and are not widely applicable to real world use cases.</p> diff --git a/blog/feeds/andrew-lamb-influxdata.atom.xml b/blog/feeds/andrew-lamb-influxdata.atom.xml index dfded17..8826861 100644 --- a/blog/feeds/andrew-lamb-influxdata.atom.xml +++ b/blog/feeds/andrew-lamb-influxdata.atom.xml @@ -17,6 +17,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with @@ -40,6 +41,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with @@ -222,22 +224,28 @@ Please refer to the <a href="https://datafusion.apache.org/blog/2025/03/21/pa indexes, as described in the next sections.</strong></p> <h2>Pruning Files with External Indexes</h2> <p>The first step in hierarchical pruning is quickly ruling out files that cannot -match the query. For example, if a system expects to have see queries that +match the query. For example, if a system expects to see queries that apply to a time range, it might create an external index to store the minimum and maximum <code>time</code> values for each file. Then, during query processing, the -system can quickly rule out files that cannot possibly contain relevant data. -For example, if the user issues a query that only matches the last 7 days of +system can quickly rule out files that cannot possibly contain relevant data.</p> +<p>For example, if the user issues a query that only matches the last 7 days of data:</p> <pre><code class="language-sql">WHERE time &gt; now() - interval '7 days' </code></pre> <p>The index can quickly rule out files that only have data older than 7 days.</p> -<!-- TODO update the diagram to match the example above -- and have time predicates --> <div class="text-center"> <img alt="Data Skipping: Pruning Files." class="img-responsive" src="/blog/images/external-parquet-indexes/prune-files.png" width="80%"/> </div> <p><strong>Figure 6</strong>: Step 1: File Pruning. Given a query predicate, systems use external indexes to quickly rule out files that cannot match the query. In this case, by consulting the index all but two files can be ruled out.</p> +<p>External indexes offer much faster lookups and lower I/O overhead than Parquet's +built-in file-level indexes by skipping further processing for many data files. +Without an external index, systems typically fall back to reading each file's +footer to find files needed for further processing. Skipping per-file processing +is especially important when reading from remote object stores such as <a href="https://aws.amazon.com/s3/">S3</a>, +<a href="https://cloud.google.com/storage">GCS</a> or <a href="https://azure.microsoft.com/en-us/services/storage/blobs/">Azure Blob Store</a>, where each request adds [tens to hundreds of +milliseconds of latency].</p> <p>There are many different systems that use external indexes to find files such as <a href="https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore">Hive Metadata Store</a>, <a href="https://iceberg.apache.org/">Iceberg</a>, @@ -560,7 +568,7 @@ execution works, help document or improve the DataFusion codebase, or just try it out, we would love for you to join us.</p> <h3>Footnotes</h3> <p><a id="footnote1"></a><code>1</code>: This trend is described in more detail in the <a href="https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/">FDAP Stack</a> blog</p> -<p><a id="footnote2"></a><code>2</code>: This layout is referred to a <a href="https://www.vldb.org/conf/2001/P169.pdf">PAX in the +<p><a id="footnote2"></a><code>2</code>: This layout is referred to as <a href="https://www.vldb.org/conf/2001/P169.pdf">PAX in the database literature</a> after the first research paper to describe the technique.</p> <p><a id="footnote3"></a><code>3</code>: Benchmaxxing (verb): to add specific optimizations that only impact benchmark results and are not widely applicable to real world use cases.</p> diff --git a/blog/feeds/andrew-lamb-influxdata.rss.xml b/blog/feeds/andrew-lamb-influxdata.rss.xml index aab07a4..1529912 100644 --- a/blog/feeds/andrew-lamb-influxdata.rss.xml +++ b/blog/feeds/andrew-lamb-influxdata.rss.xml @@ -17,6 +17,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml index a582648..3abee99 100644 --- a/blog/feeds/blog.atom.xml +++ b/blog/feeds/blog.atom.xml @@ -17,6 +17,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with @@ -40,6 +41,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with @@ -222,22 +224,28 @@ Please refer to the <a href="https://datafusion.apache.org/blog/2025/03/21/pa indexes, as described in the next sections.</strong></p> <h2>Pruning Files with External Indexes</h2> <p>The first step in hierarchical pruning is quickly ruling out files that cannot -match the query. For example, if a system expects to have see queries that +match the query. For example, if a system expects to see queries that apply to a time range, it might create an external index to store the minimum and maximum <code>time</code> values for each file. Then, during query processing, the -system can quickly rule out files that cannot possibly contain relevant data. -For example, if the user issues a query that only matches the last 7 days of +system can quickly rule out files that cannot possibly contain relevant data.</p> +<p>For example, if the user issues a query that only matches the last 7 days of data:</p> <pre><code class="language-sql">WHERE time &gt; now() - interval '7 days' </code></pre> <p>The index can quickly rule out files that only have data older than 7 days.</p> -<!-- TODO update the diagram to match the example above -- and have time predicates --> <div class="text-center"> <img alt="Data Skipping: Pruning Files." class="img-responsive" src="/blog/images/external-parquet-indexes/prune-files.png" width="80%"/> </div> <p><strong>Figure 6</strong>: Step 1: File Pruning. Given a query predicate, systems use external indexes to quickly rule out files that cannot match the query. In this case, by consulting the index all but two files can be ruled out.</p> +<p>External indexes offer much faster lookups and lower I/O overhead than Parquet's +built-in file-level indexes by skipping further processing for many data files. +Without an external index, systems typically fall back to reading each file's +footer to find files needed for further processing. Skipping per-file processing +is especially important when reading from remote object stores such as <a href="https://aws.amazon.com/s3/">S3</a>, +<a href="https://cloud.google.com/storage">GCS</a> or <a href="https://azure.microsoft.com/en-us/services/storage/blobs/">Azure Blob Store</a>, where each request adds [tens to hundreds of +milliseconds of latency].</p> <p>There are many different systems that use external indexes to find files such as <a href="https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore">Hive Metadata Store</a>, <a href="https://iceberg.apache.org/">Iceberg</a>, @@ -560,7 +568,7 @@ execution works, help document or improve the DataFusion codebase, or just try it out, we would love for you to join us.</p> <h3>Footnotes</h3> <p><a id="footnote1"></a><code>1</code>: This trend is described in more detail in the <a href="https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/">FDAP Stack</a> blog</p> -<p><a id="footnote2"></a><code>2</code>: This layout is referred to a <a href="https://www.vldb.org/conf/2001/P169.pdf">PAX in the +<p><a id="footnote2"></a><code>2</code>: This layout is referred to as <a href="https://www.vldb.org/conf/2001/P169.pdf">PAX in the database literature</a> after the first research paper to describe the technique.</p> <p><a id="footnote3"></a><code>3</code>: Benchmaxxing (verb): to add specific optimizations that only impact benchmark results and are not widely applicable to real world use cases.</p> diff --git a/blog/index.html b/blog/index.html index d011887..76f0ff6 100644 --- a/blog/index.html +++ b/blog/index.html @@ -70,6 +70,7 @@ See the License for the specific language governing permissions and limitations under the License. {% endcomment %} --> +<!-- diagrams source https://docs.google.com/presentation/d/1e_Z_F8nt2rcvlNvhU11khF5lzJJVqNtqtyJ-G3mp4-Q --> <p>It is a common misconception that <a href="https://parquet.apache.org/">Apache Parquet</a> requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org