(datafusion-site) branch asf-site updated: Commit build products

github-bot Thu, 10 Apr 2025 11:08:39 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new c1083e0  Commit build products
c1083e0 is described below

commit c1083e0d1a0b08c71253a8cfea7839869b73546c
Author: Build Pelican (action) <[email protected]>
AuthorDate: Thu Apr 10 15:41:29 2025 +0000

    Commit build products
---
 .../2025/04/10/fastest-tpch-generator/index.html   | 501 +++++++++++++++++++++
 .../andrew-lamb-achraf-b-and-sean-smith.html       | 116 +++++
 output/category/blog.html                          |  47 ++
 output/feed.xml                                    |  30 +-
 output/feeds/all-en.atom.xml                       | 470 ++++++++++++++++++-
 .../andrew-lamb-achraf-b-and-sean-smith.atom.xml   | 470 +++++++++++++++++++
 .../andrew-lamb-achraf-b-and-sean-smith.rss.xml    |  30 ++
 output/feeds/blog.atom.xml                         | 470 ++++++++++++++++++-
 .../images/fastest-tpch-generator/lamb-theory.png  | Bin 0 -> 300479 bytes
 .../fastest-tpch-generator/parquet-performance.png | Bin 0 -> 61946 bytes
 .../fastest-tpch-generator/tbl-performance.png     | Bin 0 -> 49477 bytes
 output/index.html                                  |  47 ++
 12 files changed, 2178 insertions(+), 3 deletions(-)

diff --git a/output/2025/04/10/fastest-tpch-generator/index.html 
b/output/2025/04/10/fastest-tpch-generator/index.html
new file mode 100644
index 0000000..e03c029
--- /dev/null
+++ b/output/2025/04/10/fastest-tpch-generator/index.html
@@ -0,0 +1,501 @@
+<!doctype html>
+<html class="no-js" lang="en" dir="ltr">
+  <head>
+    <meta charset="utf-8">
+    <meta http-equiv="x-ua-compatible" content="ie=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>tpchgen-rs World’s fastest open source TPC-H data generator, 
written in Rust - Apache DataFusion Blog</title>
+<link href="/blog/css/bootstrap.min.css" rel="stylesheet">
+<link href="/blog/css/fontawesome.all.min.css" rel="stylesheet">
+<link href="/blog/css/headerlink.css" rel="stylesheet">
+<link href="/blog/highlight/default.min.css" rel="stylesheet">
+<script src="/blog/highlight/highlight.js"></script>
+<script>hljs.highlightAll();</script>  </head>
+  <body class="d-flex flex-column h-100">
+  <main class="flex-shrink-0">
+<!-- nav bar -->
+<nav class="navbar navbar-expand-lg navbar-dark bg-dark" aria-label="Fifth 
navbar example">
+    <div class="container-fluid">
+        <a class="navbar-brand" href="/blog"><img 
src="/blog/images/logo_original4x.png" style="height: 32px;"/> Apache 
DataFusion Blog</a>
+        <button class="navbar-toggler" type="button" data-bs-toggle="collapse" 
data-bs-target="#navbarADP" aria-controls="navbarADP" aria-expanded="false" 
aria-label="Toggle navigation">
+            <span class="navbar-toggler-icon"></span>
+        </button>
+
+        <div class="collapse navbar-collapse" id="navbarADP">
+            <ul class="navbar-nav me-auto mb-2 mb-lg-0">
+                <li class="nav-item">
+                    <a class="nav-link" href="/blog/about.html">About</a>
+                </li>
+                <li class="nav-item">
+                    <a class="nav-link" href="/blog/feed.xml">RSS</a>
+                </li>
+            </ul>
+        </div>
+    </div>
+</nav>    
+
+
+<!-- page contents -->
+<div id="contents">
+    <div class="bg-white p-5 rounded">
+        <div class="col-sm-8 mx-auto">
+          <h1>
+              tpchgen-rs World’s fastest open source TPC-H data generator, 
written in Rust
+          </h1>
+              <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf B, and 
Sean Smith</p>
+              <!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with 
dbgen</strong>.</p>
+<p>3 members of the <a href="https://datafusion.apache.org/";>Apache 
DataFusion</a> community used Rust and open source
+development to build <a 
href="https://github.com/clflushopt/tpchgen-rs";>tpchgen-rs</a>, a fully open 
TPC-H data generator over 20x
+faster than any other implementation we know of.</p>
+<p>It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
<code>dbgen</code>
+which takes 30 minutes<sup>1</sup> (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in <a 
href="https://parquet.apache.org/";>Apache Parquet</a> format, which takes 44 
minutes using <a href="https://duckdb.org";>DuckDB</a>.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.</p>
+<p><img alt="Time to create TPC-H parquet dataset for Scale Factor  1, 10, 100 
and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/></p>
+<p><strong>Figure 1</strong>: Time to create TPC-H dataset for Scale Factor 
(see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 
minute and 14 seconds and
+<a href="https://duckdb.org";>DuckDB</a> takes 17 minutes and 48 seconds. For 
SF=1000, <code>tpchgen</code> takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&rsquo;s time as it <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator";>requires
 647 GB of RAM</a>, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+<a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md";>documentation</a>.</p>
+<p>This blog explains what TPC-H is, how we ported the vintage C data 
generator to
+Rust (yes, <a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/";>RWIR</a>)
 and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+<a href="https://datafusion.apache.org/";>Apache DataFusion</a> and <a 
href="https://glaredb.com/";>GlareDB</a>.</p>
+<h1>Try it for yourself</h1>
+<p>The tool is entirely open source under the <a 
href="https://www.apache.org/licenses/LICENSE-2.0";>Apache 2.0 license</a>. 
Visit the <a href="https://github.com/clflushopt/tpchgen-rs";>tpchgen-rs 
repository</a> or try it for yourself by run the following commands after <a 
href="https://www.rust-lang.org/tools/install";>installing Rust</a>:</p>
+<pre><code class="language-shell">$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+</code></pre>
+<h1>What is TPC-H / dbgen?</h1>
+<p>The popular <a href="https://www.tpc.org/tpch/";>TPC-H</a> benchmark (often 
referred to as TPCH) helps evaluate the
+performance of database systems on <a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing";>OLAP</a> 
queries, the kind used to build BI
+dashboards.</p>
+<p>TPC-H has become a de facto standard for analytic systems. While there are 
<a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf";>well
+known</a> limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  &ldquo;<code>TPCH Performance &lt;your favorite 
database&gt;</code>&rdquo; in any
+search engine.</p>
+<p>The benchmark was created at a time when access to high performance 
analytical
+systems was not widespread, so the <a href="https://www.tpc.org/";>Transaction 
Processing Performance Council</a>
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.</p>
+<p>TPC-H simulates a business environment with eight tables: 
<code>REGION</code>, <code>NATION</code>,
+<code>SUPPLIER</code>, <code>CUSTOMER</code>, <code>PART</code>, 
<code>PARTSUPP</code>, <code>ORDERS</code>, and <code>LINEITEM</code>. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.</p>
+<p>The queries run against data created with <code><a 
href="https://github.com/electrum/tpch-dbgen";>dbgen</a></code>, a program
+written in a pre <a href="https://en.wikipedia.org/wiki/C99";>C-99</a> dialect, 
which generates data in a format called <em>TBL</em>
+(example in Figure 2). <code>dbgen</code> creates data for each of the 8 
tables for a
+certain <em>Scale Factor</em>, commonly abbreviated as SF. Example Scale 
Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.</p>
+<pre><code 
class="language-text">103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
+229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
+263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
+327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR|
 asymptotes are fu|
+450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve.
 asymptote|
+</code></pre>
+<p><strong>Figure 2</strong>: Example TBL formatted output of 
<code>dbgen</code> for the <code>LINEITEM</code> table</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Data Size (TBL)</strong>
+</td>
+<td><strong>Data Size (Parquet)</strong>
+</td>
+</tr>
+<tr>
+<td>0.1
+   </td>
+<td>103 Mb
+   </td>
+<td>31 Mb
+   </td>
+</tr>
+<tr>
+<td>1
+   </td>
+<td>1 Gb
+   </td>
+<td>340 Mb
+   </td>
+</tr>
+<tr>
+<td>10
+   </td>
+<td>10 Gb
+   </td>
+<td>3.6 Gb
+   </td>
+</tr>
+<tr>
+<td>100
+   </td>
+<td>107 Gb
+   </td>
+<td>38 Gb
+   </td>
+</tr>
+<tr>
+<td>1000
+   </td>
+<td>1089 Gb
+   </td>
+<td>379 Gb
+   </td>
+</tr>
+</table>
+<p><strong>Table 1</strong>: TPC-H data set sizes at different scale factors 
for both TBL and <a href="https://parquet.apache.org/";>Apache Parquet</a>.</p>
+<h1>Why do we need a new TPC-H Data generator?</h1>
+<p>Despite the known limitations of the TPC-H benchmark, it is so well known 
that it
+is used frequently in database performance analysis. To run TPC-H, you must 
first
+load the data, using <code>dbgen</code>, which is not ideal for several 
reasons:</p>
+<ol>
+<li>You must find and compile a copy of the 15+ year old C program (for 
example <a 
href="https://github.com/electrum/tpch-dbgen";>electrum/tpch-dbgen</a>)</li>
+<li><code>dbgen</code> requires substantial time (Figure 3) and is not able to 
use more than one core.</li>
+<li>It outputs TBL format, which typically requires loading into your database 
(for example, <a 
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124";>here
 is how to do so</a> in Apache DataFusion) prior to query.</li>
+<li>The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.<sup>2</sup></li>
+</ol>
+<p><img alt="Time to generate TPC-H data in TBL format" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p>
+<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format. 
<code>tpchgen</code> is
+shown in blue. <code>tpchgen</code> restricted to a single core is shown in 
red. Unmodified
+<code>dbgen</code> is shown in green and <code>dbgen</code> modified to use 
<code>-O3</code> optimization level
+is shown in yellow.</p>
+<p><code>dbgen</code> is so inconvenient and takes so long that vendors often 
provide
+preloaded TPC-H data, for example <a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch";>Snowflake 
Sample Data</a>, <a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets";>Databricks
 Sample
+datasets</a> and <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets";>DuckDB
 Pre-Generated Data Sets</a>.</p>
+<p>In addition to pre-generated datasets, DuckDB also provides a <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html";>TPC-H extension</a> 
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, <a 
href="https://github.com/lmwnshn";>Wan Shen
+Lim</a> explicitly <a 
href="https://github.com/apache/datafusion/issues/14373";>mentioned the ease of 
creating the TPC-H dataset</a> as one reason
+the first student project of <a 
href="https://15799.courses.cs.cmu.edu/spring2025/";>CMU-799 Spring 2025</a> 
used DuckDB.</p>
+<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for several 
reasons:</p>
+<ol>
+<li>Creates data in a proprietary format, which requires export to use in 
other systems.</li>
+<li>Requires significant time (e.g. 17 minutes for Scale Factor 10).</li>
+<li>Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale 
Factor 10)</li>
+</ol>
+<p>The above limitations makes it impractical to generate Scale Factor 100 and
+above on laptops or standard workstations, though DuckDB offers <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets";>pre-computed
+files</a> for larger factors<sup>3</sup>.</p>
+<h1>Why Rust?</h1>
+<p>Realistically we used Rust because we wanted to integrate the data generator
+into <a href="https://datafusion.apache.org/";>Apache DataFusion</a> and <a 
href="https://glaredb.com/";>GlareDB</a>. However, we also believe Rust is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.</p>
+<p>While Rust does allow unsafe access to memory (eliding bounds checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only <a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;type=code";>unsafe</a>
 code is used to <a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72";>skip</a>
 UTF8 validation on known ASCII
+strings.</p>
+<p><img alt="Lamb Theory on Evolution of Systems Languages" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/lamb-theory.png" width="80%"/></p>
+<p><strong>Figure 4</strong>: Lamb Theory of System Language Evolution from <a 
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf";>Boston 
University
+MiDAS Fall 2024 (Data Systems Seminar)</a>, <a 
href="https://www.youtube.com/watch?v=CpnxuBwHbUc";>recording</a>. Special
+thanks to <a href="https://x.com/KurtFehlhauer";>@KurtFehlhauer</a></p>
+<h1>How: The Journey</h1>
+<p>We did it together as a team in the open over the course of a few weeks.
+<a href="https://github.com/lmwnshn";>Wan Shen Lim</a> inspired the project by 
pointing out the benefits of <a 
href="https://github.com/apache/datafusion/issues/14373";>easy TPC-H
+dataset creation</a>  and <a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600";>suggesting
 we check out a Java port on February 11,
+2025</a>. Achraf made <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0";>first
 commit a few days later</a> on February 16, and <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e";>Andrew
+and Sean started helping on March 8, 2025</a> and we <a 
href="https://crates.io/crates/tpchgen/0.1.0";>released version 0.1</a> on
+March 30, 2025.</p>
+<h2>Optimizing Single Threaded Performance</h2>
+<p>Archaf <a href="https://github.com/clflushopt/tpchgen-rs/pull/16";>completed 
the end to end conformance tests</a>, to ensure correctness, and
+an initial <a href="https://github.com/clflushopt/tpchgen-rs/pull/12";>cli 
check in</a> on March 15, 2025.</p>
+<p>On a Macbook Pro M3 (Nov 2023), the initial performance numbers were 
actually
+slower than the original Java implementation which was ported 😭. This 
wasn&rsquo;t
+surprising since the focus of the first version was to get a byte of byte
+compatible port, and knew about the performance shortcomings and how to 
approach
+them.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+</tr>
+<tr>
+<td>1
+   </td>
+<td>0m10.307s
+   </td>
+</tr>
+<tr>
+<td>10
+   </td>
+<td>1m26.530s
+   </td>
+</tr>
+<tr>
+<td>100
+   </td>
+<td>14m56.986s
+   </td>
+</tr>
+</table>
+<p><strong>Table 2</strong>: Performance of running <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12";>the initial 
tpchgen-cli</a>, measured with
+<code>time target/release/tpchgen-cli -s $SCALE_FACTOR</code></p>
+<p>With this strong foundation we began optimizing the code using Rust&rsquo;s 
low level
+memory management to improve performance while retaining memory safely. We 
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:</p>
+<ul>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/19";>Avoiding 
startup overhead</a>,</li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/26";>not</a> <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/32";>copying</a> strings 
(many more PRs as well)</li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/27";>Rust&rsquo;s 
zero overhead abstractions for dates</a></li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/35";>Static 
strings</a> (entirely safely with static lifetimes)</li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/33";>Generics to 
avoid virtual function call overhead</a></li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/62";>Moving lookups 
from runtime</a> to load time</li>
+</ul>
+<p>At the time of writing, single threaded performance is now 2.5x-2.7x faster 
than the initial version, as shown in Table 3.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster</strong>
+</td>
+</tr>
+<tr>
+<td>1
+   </td>
+<td>0m4.079s
+   </td>
+<td>2.5x
+   </td>
+</tr>
+<tr>
+<td>10
+   </td>
+<td>0m31.616s
+   </td>
+<td>2.7x
+   </td>
+</tr>
+<tr>
+<td>100
+   </td>
+<td>5m28.083s
+   </td>
+<td>2.7x
+   </td>
+</tr>
+</table>
+<p><strong>Table 3</strong>: Single threaded <code>tpchgen-cli</code> 
performance, measured with <code>time target/release/tpchgen-cli -s 
$SCALE_FACTOR --num-threads=1</code></p>
+<h2>Multi-threading</h2>
+<p>Then we applied <a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html";>Rust&rsquo;s 
fearless concurrency</a> &ndash; with a single, <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499";>small
 PR</a> (272
+net new lines) we updated the same memory safe code to run with multiple 
threads
+and consume bounded memory using <a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/";>tokio
 for the thread scheduler</a><sup>4</sup>.</p>
+<p>As shown in Table 4, with this change, tpchgen-cli generates the full SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+<code>dbgen</code> or <code>duckdb</code>) &ndash; it creates data 
<strong>faster than can be written to an SSD</strong>.
+When writing to <code>/dev/null</code> tpchgen  generates the entire dataset 
in 25 seconds
+(4 GB/s).</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster than initial implementation</strong>
+</td>
+<td><strong>Times faster than optimized single threaded</strong>
+</td>
+</tr>
+<tr>
+<td>1
+   </td>
+<td>0m1.369s
+   </td>
+<td>7.3x
+   </td>
+<td>3x
+   </td>
+</tr>
+<tr>
+<td>10
+   </td>
+<td>0m3.828s
+   </td>
+<td>22.6x
+   </td>
+<td>8.2x
+   </td>
+</tr>
+<tr>
+<td>100
+   </td>
+<td>0m32.615s
+   </td>
+<td>27.5x
+   </td>
+<td>10x
+   </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+   </td>
+<td>0m25.088s
+   </td>
+<td>35.7x
+   </td>
+<td>13.1x
+   </td>
+</tr>
+</table>
+<p><strong>Table 4</strong>: tpchgen-cli (multithreaded) performance measured 
with <code>time target/release/tpchgen-cli -s $SCALE_FACTOR</code></p>
+<p>Using Rust and async streams, the data generator is also fully streaming: 
memory
+use does not increase with increasing data size / scale factors<sup>5</sup>. 
The DuckDB
+generator seems to <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator";>require
 far more memory</a> than is commonly available on
+developer laptops and memory use increases with scale factor. With 
<code>tpchgen-cli</code>
+it is perfectly possible to create data for SF=10000 or larger on a machine 
with
+16GB of memory (assuming sufficient storage capacity).</p>
+<h2>Direct to parquet</h2>
+<p>At this point, <code>tpchgen-cli</code> could very quickly generate the TBL 
format.
+However, as described above, the TBL is annoying to work with, because</p>
+<ol>
+<li>It has no header</li>
+<li>It is like a CSV but the delimiter is <code>|</code></li>
+<li>Each line ends with an extra <code>|</code> delimiter before the newline 
🙄</li>
+<li>No system that we know can read them without additional configuration.</li>
+</ol>
+<p>We next <a href="https://github.com/clflushopt/tpchgen-rs/pull/54";>added 
support for CSV</a> generation (special thanks <a 
href="https://github.com/niebayes";>@niebayes</a> from
+Datalayers for finding and <a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73";>fixing</a> <a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65";>bugs</a>) which 
performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still 
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p>
+<p>What would be far more useful is a typed, efficient columnar format such as
+Apache Parquet which is supported by all modern query engines. So we <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71";>made</a> a
+<a href="https://crates.io/crates/tpchgen-arrow";>tpchgen-arrow</a> crate to 
create <a href="https://arrow.apache.org/";>Apache Arrow</a> arrays directly and 
then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61";>a small
+300 line PR</a> to feed those arrays to the <a 
href="https://crates.io/crates/parquet";>Rust Parquet writer</a>, again using
+tokio for parallelized but memory bound work.</p>
+<p>This approach was simple, fast and scalable, as shown in Table 5. Even 
though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time to generate Parquet</strong>
+</td>
+<td><strong>Speed compared to tbl generation</strong>
+</td>
+</tr>
+<tr>
+<td>1
+   </td>
+<td>0m1.649s
+   </td>
+<td>0.8x
+   </td>
+</tr>
+<tr>
+<td>10
+   </td>
+<td>0m5.643s
+   </td>
+<td>0.7x
+   </td>
+</tr>
+<tr>
+<td>100
+   </td>
+<td>0m45.243s
+   </td>
+<td>0.7x
+   </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+   </td>
+<td>0m45.153s
+   </td>
+<td>0.5x
+   </td>
+</tr>
+</table>
+<p><strong>Table 5</strong>: <code>tpchgen-cli</code> Parquet generation 
performance measured with  <code>time
+target/release/tpchgen-cli -s $SCALE_FACTOR --format=parquet</code></p>
+<h1>Conclusion 👊🎤</h1>
+<p>In just a few days, with some fellow database nerds and the power of Rust, 
we built something 10x better than what currently exists. We hope it inspires 
more research
+into analytical systems using the TPC-H dataset and that people build awesome
+things with it. For example, Sean has already added <a 
href="https://github.com/GlareDB/glaredb/pull/3549";>on-demand generation of
+tables to GlareDB</a>. Please consider joining us and helping out at
+<a 
href="https://github.com/clflushopt/tpchgen-rs";>https://github.com/clflushopt/tpchgen-rs</a>.</p>
+<p>We met while working together on Apache DataFusion in various capacities. 
If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to <a 
href="https://datafusion.apache.org/contributor-guide/communication.html";>come 
join us</a>. We are in the process of integrating this into
+DataFusion (see <a 
href="https://github.com/apache/datafusion/issues/14608";>apache/datafusion#14608</a>)
 if you are interested in helping 🎣</p>
+<h1>About the Authors:</h1>
+<ul>
+<li><a href="https://www.linkedin.com/in/andrewalamb/";>Andrew Lamb</a> (<a 
href="https://github.com/alamb";>@alamb</a>) is a Staff Engineer at <a 
href="https://www.influxdata.com/";>InfluxData</a> and a PMC member of <a 
href="https://datafusion.apache.org/";>Apache DataFusion</a> and <a 
href="https://arrow.apache.org/";>Apache Arrow</a>.</li>
+<li>Achraf B (<a href="https://github.com/clflushopt";>@clflushopt</a>) is a 
Software Engineer at <a href="https://optable.co/";>Optable</a> where he works 
on data infrastructure.</li>
+<li><a href="https://www.linkedin.com/in/scsmithr/";>Sean Smith</a> (<a 
href="https://github.com/scsmithr";>@scsmithr</a>) is the founder of <a 
href="https://glaredb.com/";>GlareDB</a> focused on building a fast analytics 
database.</li>
+</ul>
+<!-- Footnotes themselves at the bottom. -->
+<h2>Footnotes</h2>
+<p><em>1</em>: Actual Time: <code>30:35</code></p>
+<p><em>2</em>: It is possible to embed the dbgen code, which appears to be the 
approach taken by DuckDB. This approach was tried in GlareDB (<a 
href="https://github.com/GlareDB/glaredb/pull/3313";>GlareDB/glaredb#3313</a>), 
but ultimately shelved given the amount of effort needed to adapt and isolate 
the dbgen code.</p>
+<p><em>3</em>: It is pretty amazing to imagine the machine required to 
generate SF300 that had 1.8TB (!!) of RAM</p>
+<p><em>4</em>: We tried to <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/34";>use Rayon (see 
discussion here)</a>, but could not easily keep memory bounded.</p>
+<p><em>5</em>: <code>tpchgen-cli</code> memory usage is a function of the 
number of threads:  each thread needs some buffer space</p>
+        </div>
+      </div>
+    </div>    
+    <!-- footer -->
+    <div class="row">
+      <div class="large-12 medium-12 columns">
+        <p style="font-style: italic; font-size: 0.8rem; text-align: center;">
+          Copyright 2025, <a href="https://www.apache.org/";>The Apache 
Software Foundation</a>, Licensed under the <a 
href="https://www.apache.org/licenses/LICENSE-2.0";>Apache License, Version 
2.0</a>.<br/>
+          Apache&reg; and the Apache feather logo are trademarks of The Apache 
Software Foundation.
+        </p>
+      </div>
+    </div>
+    <script src="/blog/js/bootstrap.bundle.min.js"></script>  </main>
+  </body>
+</html>
diff --git a/output/author/andrew-lamb-achraf-b-and-sean-smith.html 
b/output/author/andrew-lamb-achraf-b-and-sean-smith.html
new file mode 100644
index 0000000..7b48354
--- /dev/null
+++ b/output/author/andrew-lamb-achraf-b-and-sean-smith.html
@@ -0,0 +1,116 @@
+    <!doctype html>
+    <html class="no-js" lang="en" dir="ltr">
+    <head>
+        <meta charset="utf-8">
+        <meta http-equiv="x-ua-compatible" content="ie=edge">
+        <meta name="viewport" content="width=device-width, initial-scale=1.0">
+        <title>Apache DataFusion Blog</title>
+<link href="/blog/css/bootstrap.min.css" rel="stylesheet">
+<link href="/blog/css/fontawesome.all.min.css" rel="stylesheet">
+<link href="/blog/css/headerlink.css" rel="stylesheet">
+<link href="/blog/highlight/default.min.css" rel="stylesheet">
+<script src="/blog/highlight/highlight.js"></script>
+<script>hljs.highlightAll();</script>        <link 
href="/blog/css/blog_index.css" rel="stylesheet">
+    </head>
+    <body class="d-flex flex-column h-100">
+    <main class="flex-shrink-0">
+        <div>
+
+<!-- nav bar -->
+<nav class="navbar navbar-expand-lg navbar-dark bg-dark" aria-label="Fifth 
navbar example">
+    <div class="container-fluid">
+        <a class="navbar-brand" href="/blog"><img 
src="/blog/images/logo_original4x.png" style="height: 32px;"/> Apache 
DataFusion Blog</a>
+        <button class="navbar-toggler" type="button" data-bs-toggle="collapse" 
data-bs-target="#navbarADP" aria-controls="navbarADP" aria-expanded="false" 
aria-label="Toggle navigation">
+            <span class="navbar-toggler-icon"></span>
+        </button>
+
+        <div class="collapse navbar-collapse" id="navbarADP">
+            <ul class="navbar-nav me-auto mb-2 mb-lg-0">
+                <li class="nav-item">
+                    <a class="nav-link" href="/blog/about.html">About</a>
+                </li>
+                <li class="nav-item">
+                    <a class="nav-link" href="/blog/feed.xml">RSS</a>
+                </li>
+            </ul>
+        </div>
+    </div>
+</nav>
+            <div id="contents">
+                <div class="bg-white p-5 rounded">
+                    <div class="col-sm-8 mx-auto">
+<div id="contents">
+    <div class="bg-white p-5 rounded">
+        <div class="col-sm-8 mx-auto">
+
+            <h3>Welcome to the Apache DataFusion Blog!</h3>
+            <p><i>Here you can find the latest updates from DataFusion and 
related projects.</i></p>
+
+
+    <!-- Post -->
+    <div class="row">
+        <div class="callout">
+            <article class="post">
+                <header>
+                    <div class="title">
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open 
source TPC-H data generator, written in Rust</a></h1>
+                        <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
+                        <p><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with 
dbgen</strong>.</p>
+<p>3 members of the <a href="https://datafusion.apache.org/";>Apache 
DataFusion</a> community used Rust and open source
+development to build <a 
href="https://github.com/clflushopt/tpchgen-rs";>tpchgen-rs</a>, a fully open 
TPC-H data generator over …</p></p>
+                        <footer>
+                            <ul class="actions">
+                                <div style="text-align: right"><a 
href="/blog/2025/04/10/fastest-tpch-generator" class="button medium">Continue 
Reading</a></div>
+                            </ul>
+                            <ul class="stats">
+                            </ul>
+                        </footer>
+            </article>
+        </div>
+    </div>
+
+        </div>
+    </div>
+</div>                    </div>
+                </div>
+            </div>
+
+    <!-- footer -->
+    <div class="row">
+      <div class="large-12 medium-12 columns">
+        <p style="font-style: italic; font-size: 0.8rem; text-align: center;">
+          Copyright 2025, <a href="https://www.apache.org/";>The Apache 
Software Foundation</a>, Licensed under the <a 
href="https://www.apache.org/licenses/LICENSE-2.0";>Apache License, Version 
2.0</a>.<br/>
+          Apache&reg; and the Apache feather logo are trademarks of The Apache 
Software Foundation.
+        </p>
+      </div>
+    </div>
+    <script src="/blog/js/bootstrap.bundle.min.js"></script>        </div>
+    </main>
+    </body>
+    </html>
diff --git a/output/category/blog.html b/output/category/blog.html
index 4385e63..bcf7e8c 100644
--- a/output/category/blog.html
+++ b/output/category/blog.html
@@ -47,6 +47,53 @@
             <p><i>Here you can find the latest updates from DataFusion and 
related projects.</i></p>
 
 
+    <!-- Post -->
+    <div class="row">
+        <div class="callout">
+            <article class="post">
+                <header>
+                    <div class="title">
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open 
source TPC-H data generator, written in Rust</a></h1>
+                        <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
+                        <p><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with 
dbgen</strong>.</p>
+<p>3 members of the <a href="https://datafusion.apache.org/";>Apache 
DataFusion</a> community used Rust and open source
+development to build <a 
href="https://github.com/clflushopt/tpchgen-rs";>tpchgen-rs</a>, a fully open 
TPC-H data generator over …</p></p>
+                        <footer>
+                            <ul class="actions">
+                                <div style="text-align: right"><a 
href="/blog/2025/04/10/fastest-tpch-generator" class="button medium">Continue 
Reading</a></div>
+                            </ul>
+                            <ul class="stats">
+                            </ul>
+                        </footer>
+            </article>
+        </div>
+    </div>
     <!-- Post -->
     <div class="row">
         <div class="callout">
diff --git a/output/feed.xml b/output/feed.xml
index b09ca71..430f736 100644
--- a/output/feed.xml
+++ b/output/feed.xml
@@ -1,5 +1,33 @@
 <?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion 
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Sun,
 30 Mar 2025 00:00:00 +0000</lastBuildDate><item><title>Apache DataFusion 
Python 46.0.0 
Released</title><link>https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0</link><description>&lt;!--
+<rss version="2.0"><channel><title>Apache DataFusion 
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs World’s 
fastest open source TPC-H data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+--&gt;
+&lt;style&gt;
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+&lt;/style&gt;
+&lt;p&gt;&lt;strong&gt;TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ 
with dbgen&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
+development to build &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a 
fully open TPC-H data generator over …&lt;/p&gt;</description><dc:creator 
xmlns:dc="http://purl.org/dc/elements/1.1/";>Andrew Lamb, Achraf B, and Sean 
Smith</dc:creator><pubDate>Thu, 10 Apr 2025 00:00:00 +0000</pubDate><guid 
isPermaLink="false">tag:datafusion.apache.org,2025-04-10:/blog/2025/04/10/fastest-tpch-generator</guid><category>blog</category></item><item><title>Apache
 DataFusion  [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/output/feeds/all-en.atom.xml b/output/feeds/all-en.atom.xml
index 753370c..bde3316 100644
--- a/output/feeds/all-en.atom.xml
+++ b/output/feeds/all-en.atom.xml
@@ -1,5 +1,473 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion 
Blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-30T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
 DataFusion Python 46.0.0 Released</title><link 
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0 
[...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion 
Blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
 World’s fastest open source TPC-H data generator, written in Rust</title><link 
href="https://datafusion.apache.org/blog [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+--&gt;
+&lt;style&gt;
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+&lt;/style&gt;
+&lt;p&gt;&lt;strong&gt;TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ 
with dbgen&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
+development to build &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a 
fully open TPC-H data generator over …&lt;/p&gt;</summary><content 
type="html">&lt;!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+--&gt;
+&lt;style&gt;
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+&lt;/style&gt;
+&lt;p&gt;&lt;strong&gt;TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ 
with dbgen&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
+development to build &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a 
fully open TPC-H data generator over 20x
+faster than any other implementation we know of.&lt;/p&gt;
+&lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
+which takes 30 minutes&lt;sup&gt;1&lt;/sup&gt; (0.05GB/sec). On the same 
machine, it takes less than
+2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format, which 
takes 44 minutes using &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt;.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.&lt;/p&gt;
+&lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 
&lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute and 14 seconds and
+&lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;This blog explains what TPC-H is, how we ported the vintage C data 
generator to
+Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Try it for yourself&lt;/h1&gt;
+&lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-shell"&gt;$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+&lt;/code&gt;&lt;/pre&gt;
+&lt;h1&gt;What is TPC-H / dbgen?&lt;/h1&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries, the kind used to build BI
+dashboards.&lt;/p&gt;
+&lt;p&gt;TPC-H has become a de facto standard for analytic systems. While 
there are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
+known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
+search engine.&lt;/p&gt;
+&lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
+systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.&lt;/p&gt;
+&lt;p&gt;TPC-H simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
+&lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.&lt;/p&gt;
+&lt;p&gt;The queries run against data created with &lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program
+written in a pre &lt;a 
href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; dialect, which 
generates data in a format called &lt;em&gt;TBL&lt;/em&gt;
+(example in Figure 2). &lt;code&gt;dbgen&lt;/code&gt; creates data for each of 
the 8 tables for a
+certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly abbreviated as SF. Example 
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.&lt;/p&gt;
+&lt;pre&gt;&lt;code 
class="language-text"&gt;103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
+229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
+263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
+327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR|
 asymptotes are fu|
+450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve.
 asymptote|
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;&lt;strong&gt;Figure 2&lt;/strong&gt;: Example TBL formatted output 
of &lt;code&gt;dbgen&lt;/code&gt; for the &lt;code&gt;LINEITEM&lt;/code&gt; 
table&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Data Size (TBL)&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Data Size (Parquet)&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;0.1
+   &lt;/td&gt;
+&lt;td&gt;103 Mb
+   &lt;/td&gt;
+&lt;td&gt;31 Mb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;1 Gb
+   &lt;/td&gt;
+&lt;td&gt;340 Mb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;10 Gb
+   &lt;/td&gt;
+&lt;td&gt;3.6 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;107 Gb
+   &lt;/td&gt;
+&lt;td&gt;38 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1000
+   &lt;/td&gt;
+&lt;td&gt;1089 Gb
+   &lt;/td&gt;
+&lt;td&gt;379 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPC-H data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Why do we need a new TPC-H Data generator?&lt;/h1&gt;
+&lt;p&gt;Despite the known limitations of the TPC-H benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPC-H, you must 
first
+load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
+&lt;li&gt;&lt;code&gt;dbgen&lt;/code&gt; requires substantial time (Figure 3) 
and is not able to use more than one core.&lt;/li&gt;
+&lt;li&gt;It outputs TBL format, which typically requires loading into your 
database (for example, &lt;a 
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124"&gt;here
 is how to do so&lt;/a&gt; in Apache DataFusion) prior to query.&lt;/li&gt;
+&lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.&lt;sup&gt;2&lt;/sup&gt;&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. &lt;code&gt;tpchgen&lt;/code&gt; is
+shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
+&lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
+is shown in yellow.&lt;/p&gt;
+&lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
+preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;Databricks
 Sample
+datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPC-H 
extension&lt;/a&gt; 
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
+Lim&lt;/a&gt; explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPC-H dataset&lt;/a&gt; as one reason
+the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPC-H extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
+&lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
+&lt;li&gt;Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale 
Factor 10)&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and
+above on laptops or standard workstations, though DuckDB offers &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;pre-computed
+files&lt;/a&gt; for larger factors&lt;sup&gt;3&lt;/sup&gt;.&lt;/p&gt;
+&lt;h1&gt;Why Rust?&lt;/h1&gt;
+&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator
+into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.&lt;/p&gt;
+&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII
+strings.&lt;/p&gt;
+&lt;p&gt;&lt;img alt="Lamb Theory on Evolution of Systems Languages" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/lamb-theory.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 4&lt;/strong&gt;: Lamb Theory of System Language 
Evolution from &lt;a 
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf"&gt;Boston 
University
+MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtube.com/watch?v=CpnxuBwHbUc"&gt;recording&lt;/a&gt;. 
Special
+thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
+&lt;h1&gt;How: The Journey&lt;/h1&gt;
+&lt;p&gt;We did it together as a team in the open over the course of a few 
weeks.
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPC-H
+dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
+2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
+and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
+March 30, 2025.&lt;/p&gt;
+&lt;h2&gt;Optimizing Single Threaded Performance&lt;/h2&gt;
+&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance tests&lt;/a&gt;, to ensure correctness, and
+an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt; on March 15, 2025.&lt;/p&gt;
+&lt;p&gt;On a Macbook Pro M3 (Nov 2023), the initial performance numbers were 
actually
+slower than the original Java implementation which was ported 😭. This 
wasn&amp;rsquo;t
+surprising since the focus of the first version was to get a byte of byte
+compatible port, and knew about the performance shortcomings and how to 
approach
+them.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m10.307s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;1m26.530s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;14m56.986s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial 
tpchgen-cli&lt;/a&gt;, measured with
+&lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;With this strong foundation we began optimizing the code using 
Rust&amp;rsquo;s low level
+memory management to improve performance while retaining memory safely. We 
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/19"&gt;Avoiding startup 
overhead&lt;/a&gt;,&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/26"&gt;not&lt;/a&gt; &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/32"&gt;copying&lt;/a&gt; 
strings (many more PRs as well)&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/27"&gt;Rust&amp;rsquo;s 
zero overhead abstractions for dates&lt;/a&gt;&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/35"&gt;Static 
strings&lt;/a&gt; (entirely safely with static lifetimes)&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/33"&gt;Generics to avoid 
virtual function call overhead&lt;/a&gt;&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/62"&gt;Moving lookups from 
runtime&lt;/a&gt; to load time&lt;/li&gt;
+&lt;/ul&gt;
+&lt;p&gt;At the time of writing, single threaded performance is now 2.5x-2.7x 
faster than the initial version, as shown in Table 3.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m4.079s
+   &lt;/td&gt;
+&lt;td&gt;2.5x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m31.616s
+   &lt;/td&gt;
+&lt;td&gt;2.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;5m28.083s
+   &lt;/td&gt;
+&lt;td&gt;2.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Single threaded 
&lt;code&gt;tpchgen-cli&lt;/code&gt; performance, measured with 
&lt;code&gt;time target/release/tpchgen-cli -s $SCALE_FACTOR 
--num-threads=1&lt;/code&gt;&lt;/p&gt;
+&lt;h2&gt;Multi-threading&lt;/h2&gt;
+&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless concurrency&lt;/a&gt; &amp;ndash; with a single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR&lt;/a&gt; (272
+net new lines) we updated the same memory safe code to run with multiple 
threads
+and consume bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread scheduler&lt;/a&gt;&lt;sup&gt;4&lt;/sup&gt;.&lt;/p&gt;
+&lt;p&gt;As shown in Table 4, with this change, tpchgen-cli generates the full 
SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+&lt;code&gt;dbgen&lt;/code&gt; or &lt;code&gt;duckdb&lt;/code&gt;) &amp;ndash; 
it creates data &lt;strong&gt;faster than can be written to an 
SSD&lt;/strong&gt;.
+When writing to &lt;code&gt;/dev/null&lt;/code&gt; tpchgen  generates the 
entire dataset in 25 seconds
+(4 GB/s).&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster than initial implementation&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster than optimized single 
threaded&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m1.369s
+   &lt;/td&gt;
+&lt;td&gt;7.3x
+   &lt;/td&gt;
+&lt;td&gt;3x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m3.828s
+   &lt;/td&gt;
+&lt;td&gt;22.6x
+   &lt;/td&gt;
+&lt;td&gt;8.2x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;0m32.615s
+   &lt;/td&gt;
+&lt;td&gt;27.5x
+   &lt;/td&gt;
+&lt;td&gt;10x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100 (to /dev/null)
+   &lt;/td&gt;
+&lt;td&gt;0m25.088s
+   &lt;/td&gt;
+&lt;td&gt;35.7x
+   &lt;/td&gt;
+&lt;td&gt;13.1x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 4&lt;/strong&gt;: tpchgen-cli (multithreaded) 
performance measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory
+use does not increase with increasing data size / scale 
factors&lt;sup&gt;5&lt;/sup&gt;. The DuckDB
+generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on
+developer laptops and memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt;
+it is perfectly possible to create data for SF=10000 or larger on a machine 
with
+16GB of memory (assuming sufficient storage capacity).&lt;/p&gt;
+&lt;h2&gt;Direct to parquet&lt;/h2&gt;
+&lt;p&gt;At this point, &lt;code&gt;tpchgen-cli&lt;/code&gt; could very 
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;It has no header&lt;/li&gt;
+&lt;li&gt;It is like a CSV but the delimiter is 
&lt;code&gt;|&lt;/code&gt;&lt;/li&gt;
+&lt;li&gt;Each line ends with an extra &lt;code&gt;|&lt;/code&gt; delimiter 
before the newline 🙄&lt;/li&gt;
+&lt;li&gt;No system that we know can read them without additional 
configuration.&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;@niebayes&lt;/a&gt; from
+Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still 
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
+Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
+&lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
+300 line PR&lt;/a&gt; to feed those arrays to the &lt;a 
href="https://crates.io/crates/parquet"&gt;Rust Parquet writer&lt;/a&gt;, again 
using
+tokio for parallelized but memory bound work.&lt;/p&gt;
+&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time to generate Parquet&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Speed compared to tbl generation&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m1.649s
+   &lt;/td&gt;
+&lt;td&gt;0.8x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m5.643s
+   &lt;/td&gt;
+&lt;td&gt;0.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;0m45.243s
+   &lt;/td&gt;
+&lt;td&gt;0.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100 (to /dev/null)
+   &lt;/td&gt;
+&lt;td&gt;0m45.153s
+   &lt;/td&gt;
+&lt;td&gt;0.5x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: 
&lt;code&gt;tpchgen-cli&lt;/code&gt; Parquet generation performance measured 
with  &lt;code&gt;time
+target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&gt;
+&lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
+&lt;p&gt;In just a few days, with some fellow database nerds and the power of 
Rust, we built something 10x better than what currently exists. We hope it 
inspires more research
+into analytical systems using the TPC-H dataset and that people build awesome
+things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
+tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into
+DataFusion (see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
+&lt;h1&gt;About the Authors:&lt;/h1&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt; and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.&lt;/li&gt;
+&lt;li&gt;Achraf B (&lt;a 
href="https://github.com/clflushopt"&gt;@clflushopt&lt;/a&gt;) is a Software 
Engineer at &lt;a href="https://optable.co/"&gt;Optable&lt;/a&gt; where he 
works on data infrastructure.&lt;/li&gt;
+&lt;li&gt;&lt;a href="https://www.linkedin.com/in/scsmithr/"&gt;Sean 
Smith&lt;/a&gt; (&lt;a 
href="https://github.com/scsmithr"&gt;@scsmithr&lt;/a&gt;) is the founder of 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt; focused on building a 
fast analytics database.&lt;/li&gt;
+&lt;/ul&gt;
+&lt;!-- Footnotes themselves at the bottom. --&gt;
+&lt;h2&gt;Footnotes&lt;/h2&gt;
+&lt;p&gt;&lt;em&gt;1&lt;/em&gt;: Actual Time: 
&lt;code&gt;30:35&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;2&lt;/em&gt;: It is possible to embed the dbgen code, which 
appears to be the approach taken by DuckDB. This approach was tried in GlareDB 
(&lt;a 
href="https://github.com/GlareDB/glaredb/pull/3313"&gt;GlareDB/glaredb#3313&lt;/a&gt;),
 but ultimately shelved given the amount of effort needed to adapt and isolate 
the dbgen code.&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;3&lt;/em&gt;: It is pretty amazing to imagine the machine 
required to generate SF300 that had 1.8TB (!!) of RAM&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;4&lt;/em&gt;: We tried to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/34"&gt;use Rayon (see 
discussion here)&lt;/a&gt;, but could not easily keep memory bounded.&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;5&lt;/em&gt;: &lt;code&gt;tpchgen-cli&lt;/code&gt; memory 
usage is a function of the number of threads:  each thread needs some buffer 
space&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Apache DataFusion Python 46.0.0 
Released</title><link 
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0"; 
rel="alternate"></link><published>2025-03-30T00:00:00+00:00</published><updated>2025-03-30T00:00:00+00:00</updated><author><na
 [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/output/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml 
b/output/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
new file mode 100644
index 0000000..21b8660
--- /dev/null
+++ b/output/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
@@ -0,0 +1,470 @@
+<?xml version="1.0" encoding="utf-8"?>
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
Andrew Lamb, Achraf B, and Sean Smith</title><link 
href="https://datafusion.apache.org/blog/"; rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml";
 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
 World’s fastest open source TPC-H data generator,  [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+--&gt;
+&lt;style&gt;
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+&lt;/style&gt;
+&lt;p&gt;&lt;strong&gt;TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ 
with dbgen&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
+development to build &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a 
fully open TPC-H data generator over …&lt;/p&gt;</summary><content 
type="html">&lt;!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+--&gt;
+&lt;style&gt;
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+&lt;/style&gt;
+&lt;p&gt;&lt;strong&gt;TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ 
with dbgen&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
+development to build &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a 
fully open TPC-H data generator over 20x
+faster than any other implementation we know of.&lt;/p&gt;
+&lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
+which takes 30 minutes&lt;sup&gt;1&lt;/sup&gt; (0.05GB/sec). On the same 
machine, it takes less than
+2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format, which 
takes 44 minutes using &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt;.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.&lt;/p&gt;
+&lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 
&lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute and 14 seconds and
+&lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;This blog explains what TPC-H is, how we ported the vintage C data 
generator to
+Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Try it for yourself&lt;/h1&gt;
+&lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-shell"&gt;$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+&lt;/code&gt;&lt;/pre&gt;
+&lt;h1&gt;What is TPC-H / dbgen?&lt;/h1&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries, the kind used to build BI
+dashboards.&lt;/p&gt;
+&lt;p&gt;TPC-H has become a de facto standard for analytic systems. While 
there are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
+known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
+search engine.&lt;/p&gt;
+&lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
+systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.&lt;/p&gt;
+&lt;p&gt;TPC-H simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
+&lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.&lt;/p&gt;
+&lt;p&gt;The queries run against data created with &lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program
+written in a pre &lt;a 
href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; dialect, which 
generates data in a format called &lt;em&gt;TBL&lt;/em&gt;
+(example in Figure 2). &lt;code&gt;dbgen&lt;/code&gt; creates data for each of 
the 8 tables for a
+certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly abbreviated as SF. Example 
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.&lt;/p&gt;
+&lt;pre&gt;&lt;code 
class="language-text"&gt;103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
+229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
+263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
+327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR|
 asymptotes are fu|
+450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve.
 asymptote|
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;&lt;strong&gt;Figure 2&lt;/strong&gt;: Example TBL formatted output 
of &lt;code&gt;dbgen&lt;/code&gt; for the &lt;code&gt;LINEITEM&lt;/code&gt; 
table&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Data Size (TBL)&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Data Size (Parquet)&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;0.1
+   &lt;/td&gt;
+&lt;td&gt;103 Mb
+   &lt;/td&gt;
+&lt;td&gt;31 Mb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;1 Gb
+   &lt;/td&gt;
+&lt;td&gt;340 Mb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;10 Gb
+   &lt;/td&gt;
+&lt;td&gt;3.6 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;107 Gb
+   &lt;/td&gt;
+&lt;td&gt;38 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1000
+   &lt;/td&gt;
+&lt;td&gt;1089 Gb
+   &lt;/td&gt;
+&lt;td&gt;379 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPC-H data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Why do we need a new TPC-H Data generator?&lt;/h1&gt;
+&lt;p&gt;Despite the known limitations of the TPC-H benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPC-H, you must 
first
+load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
+&lt;li&gt;&lt;code&gt;dbgen&lt;/code&gt; requires substantial time (Figure 3) 
and is not able to use more than one core.&lt;/li&gt;
+&lt;li&gt;It outputs TBL format, which typically requires loading into your 
database (for example, &lt;a 
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124"&gt;here
 is how to do so&lt;/a&gt; in Apache DataFusion) prior to query.&lt;/li&gt;
+&lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.&lt;sup&gt;2&lt;/sup&gt;&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. &lt;code&gt;tpchgen&lt;/code&gt; is
+shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
+&lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
+is shown in yellow.&lt;/p&gt;
+&lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
+preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;Databricks
 Sample
+datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPC-H 
extension&lt;/a&gt; 
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
+Lim&lt;/a&gt; explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPC-H dataset&lt;/a&gt; as one reason
+the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPC-H extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
+&lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
+&lt;li&gt;Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale 
Factor 10)&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and
+above on laptops or standard workstations, though DuckDB offers &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;pre-computed
+files&lt;/a&gt; for larger factors&lt;sup&gt;3&lt;/sup&gt;.&lt;/p&gt;
+&lt;h1&gt;Why Rust?&lt;/h1&gt;
+&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator
+into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.&lt;/p&gt;
+&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII
+strings.&lt;/p&gt;
+&lt;p&gt;&lt;img alt="Lamb Theory on Evolution of Systems Languages" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/lamb-theory.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 4&lt;/strong&gt;: Lamb Theory of System Language 
Evolution from &lt;a 
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf"&gt;Boston 
University
+MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtube.com/watch?v=CpnxuBwHbUc"&gt;recording&lt;/a&gt;. 
Special
+thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
+&lt;h1&gt;How: The Journey&lt;/h1&gt;
+&lt;p&gt;We did it together as a team in the open over the course of a few 
weeks.
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPC-H
+dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
+2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
+and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
+March 30, 2025.&lt;/p&gt;
+&lt;h2&gt;Optimizing Single Threaded Performance&lt;/h2&gt;
+&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance tests&lt;/a&gt;, to ensure correctness, and
+an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt; on March 15, 2025.&lt;/p&gt;
+&lt;p&gt;On a Macbook Pro M3 (Nov 2023), the initial performance numbers were 
actually
+slower than the original Java implementation which was ported 😭. This 
wasn&amp;rsquo;t
+surprising since the focus of the first version was to get a byte of byte
+compatible port, and knew about the performance shortcomings and how to 
approach
+them.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m10.307s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;1m26.530s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;14m56.986s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial 
tpchgen-cli&lt;/a&gt;, measured with
+&lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;With this strong foundation we began optimizing the code using 
Rust&amp;rsquo;s low level
+memory management to improve performance while retaining memory safely. We 
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/19"&gt;Avoiding startup 
overhead&lt;/a&gt;,&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/26"&gt;not&lt;/a&gt; &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/32"&gt;copying&lt;/a&gt; 
strings (many more PRs as well)&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/27"&gt;Rust&amp;rsquo;s 
zero overhead abstractions for dates&lt;/a&gt;&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/35"&gt;Static 
strings&lt;/a&gt; (entirely safely with static lifetimes)&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/33"&gt;Generics to avoid 
virtual function call overhead&lt;/a&gt;&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/62"&gt;Moving lookups from 
runtime&lt;/a&gt; to load time&lt;/li&gt;
+&lt;/ul&gt;
+&lt;p&gt;At the time of writing, single threaded performance is now 2.5x-2.7x 
faster than the initial version, as shown in Table 3.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m4.079s
+   &lt;/td&gt;
+&lt;td&gt;2.5x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m31.616s
+   &lt;/td&gt;
+&lt;td&gt;2.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;5m28.083s
+   &lt;/td&gt;
+&lt;td&gt;2.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Single threaded 
&lt;code&gt;tpchgen-cli&lt;/code&gt; performance, measured with 
&lt;code&gt;time target/release/tpchgen-cli -s $SCALE_FACTOR 
--num-threads=1&lt;/code&gt;&lt;/p&gt;
+&lt;h2&gt;Multi-threading&lt;/h2&gt;
+&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless concurrency&lt;/a&gt; &amp;ndash; with a single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR&lt;/a&gt; (272
+net new lines) we updated the same memory safe code to run with multiple 
threads
+and consume bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread scheduler&lt;/a&gt;&lt;sup&gt;4&lt;/sup&gt;.&lt;/p&gt;
+&lt;p&gt;As shown in Table 4, with this change, tpchgen-cli generates the full 
SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+&lt;code&gt;dbgen&lt;/code&gt; or &lt;code&gt;duckdb&lt;/code&gt;) &amp;ndash; 
it creates data &lt;strong&gt;faster than can be written to an 
SSD&lt;/strong&gt;.
+When writing to &lt;code&gt;/dev/null&lt;/code&gt; tpchgen  generates the 
entire dataset in 25 seconds
+(4 GB/s).&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster than initial implementation&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster than optimized single 
threaded&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m1.369s
+   &lt;/td&gt;
+&lt;td&gt;7.3x
+   &lt;/td&gt;
+&lt;td&gt;3x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m3.828s
+   &lt;/td&gt;
+&lt;td&gt;22.6x
+   &lt;/td&gt;
+&lt;td&gt;8.2x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;0m32.615s
+   &lt;/td&gt;
+&lt;td&gt;27.5x
+   &lt;/td&gt;
+&lt;td&gt;10x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100 (to /dev/null)
+   &lt;/td&gt;
+&lt;td&gt;0m25.088s
+   &lt;/td&gt;
+&lt;td&gt;35.7x
+   &lt;/td&gt;
+&lt;td&gt;13.1x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 4&lt;/strong&gt;: tpchgen-cli (multithreaded) 
performance measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory
+use does not increase with increasing data size / scale 
factors&lt;sup&gt;5&lt;/sup&gt;. The DuckDB
+generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on
+developer laptops and memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt;
+it is perfectly possible to create data for SF=10000 or larger on a machine 
with
+16GB of memory (assuming sufficient storage capacity).&lt;/p&gt;
+&lt;h2&gt;Direct to parquet&lt;/h2&gt;
+&lt;p&gt;At this point, &lt;code&gt;tpchgen-cli&lt;/code&gt; could very 
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;It has no header&lt;/li&gt;
+&lt;li&gt;It is like a CSV but the delimiter is 
&lt;code&gt;|&lt;/code&gt;&lt;/li&gt;
+&lt;li&gt;Each line ends with an extra &lt;code&gt;|&lt;/code&gt; delimiter 
before the newline 🙄&lt;/li&gt;
+&lt;li&gt;No system that we know can read them without additional 
configuration.&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;@niebayes&lt;/a&gt; from
+Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still 
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
+Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
+&lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
+300 line PR&lt;/a&gt; to feed those arrays to the &lt;a 
href="https://crates.io/crates/parquet"&gt;Rust Parquet writer&lt;/a&gt;, again 
using
+tokio for parallelized but memory bound work.&lt;/p&gt;
+&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time to generate Parquet&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Speed compared to tbl generation&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m1.649s
+   &lt;/td&gt;
+&lt;td&gt;0.8x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m5.643s
+   &lt;/td&gt;
+&lt;td&gt;0.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;0m45.243s
+   &lt;/td&gt;
+&lt;td&gt;0.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100 (to /dev/null)
+   &lt;/td&gt;
+&lt;td&gt;0m45.153s
+   &lt;/td&gt;
+&lt;td&gt;0.5x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: 
&lt;code&gt;tpchgen-cli&lt;/code&gt; Parquet generation performance measured 
with  &lt;code&gt;time
+target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&gt;
+&lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
+&lt;p&gt;In just a few days, with some fellow database nerds and the power of 
Rust, we built something 10x better than what currently exists. We hope it 
inspires more research
+into analytical systems using the TPC-H dataset and that people build awesome
+things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
+tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into
+DataFusion (see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
+&lt;h1&gt;About the Authors:&lt;/h1&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt; and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.&lt;/li&gt;
+&lt;li&gt;Achraf B (&lt;a 
href="https://github.com/clflushopt"&gt;@clflushopt&lt;/a&gt;) is a Software 
Engineer at &lt;a href="https://optable.co/"&gt;Optable&lt;/a&gt; where he 
works on data infrastructure.&lt;/li&gt;
+&lt;li&gt;&lt;a href="https://www.linkedin.com/in/scsmithr/"&gt;Sean 
Smith&lt;/a&gt; (&lt;a 
href="https://github.com/scsmithr"&gt;@scsmithr&lt;/a&gt;) is the founder of 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt; focused on building a 
fast analytics database.&lt;/li&gt;
+&lt;/ul&gt;
+&lt;!-- Footnotes themselves at the bottom. --&gt;
+&lt;h2&gt;Footnotes&lt;/h2&gt;
+&lt;p&gt;&lt;em&gt;1&lt;/em&gt;: Actual Time: 
&lt;code&gt;30:35&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;2&lt;/em&gt;: It is possible to embed the dbgen code, which 
appears to be the approach taken by DuckDB. This approach was tried in GlareDB 
(&lt;a 
href="https://github.com/GlareDB/glaredb/pull/3313"&gt;GlareDB/glaredb#3313&lt;/a&gt;),
 but ultimately shelved given the amount of effort needed to adapt and isolate 
the dbgen code.&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;3&lt;/em&gt;: It is pretty amazing to imagine the machine 
required to generate SF300 that had 1.8TB (!!) of RAM&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;4&lt;/em&gt;: We tried to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/34"&gt;use Rayon (see 
discussion here)&lt;/a&gt;, but could not easily keep memory bounded.&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;5&lt;/em&gt;: &lt;code&gt;tpchgen-cli&lt;/code&gt; memory 
usage is a function of the number of threads:  each thread needs some buffer 
space&lt;/p&gt;</content><category term="blog"></category></entry></feed>
\ No newline at end of file
diff --git a/output/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml 
b/output/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
new file mode 100644
index 0000000..616048a
--- /dev/null
+++ b/output/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
@@ -0,0 +1,30 @@
+<?xml version="1.0" encoding="utf-8"?>
+<rss version="2.0"><channel><title>Apache DataFusion Blog - Andrew Lamb, 
Achraf B, and Sean 
Smith</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs World’s 
fastest open source TPC-H data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+--&gt;
+&lt;style&gt;
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+&lt;/style&gt;
+&lt;p&gt;&lt;strong&gt;TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ 
with dbgen&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
+development to build &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a 
fully open TPC-H data generator over …&lt;/p&gt;</description><dc:creator 
xmlns:dc="http://purl.org/dc/elements/1.1/";>Andrew Lamb, Achraf B, and Sean 
Smith</dc:creator><pubDate>Thu, 10 Apr 2025 00:00:00 +0000</pubDate><guid 
isPermaLink="false">tag:datafusion.apache.org,2025-04-10:/blog/2025/04/10/fastest-tpch-generator</guid><category>blog</category></item></channel></rss>
\ No newline at end of file
diff --git a/output/feeds/blog.atom.xml b/output/feeds/blog.atom.xml
index dc80333..c3fdf9f 100644
--- a/output/feeds/blog.atom.xml
+++ b/output/feeds/blog.atom.xml
@@ -1,5 +1,473 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-30T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
 DataFusion Python 46.0.0 Released</title><link 
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-4 [...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
 World’s fastest open source TPC-H data generator, written in Rust</title><link 
href="https://datafusion.apache.org [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+--&gt;
+&lt;style&gt;
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+&lt;/style&gt;
+&lt;p&gt;&lt;strong&gt;TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ 
with dbgen&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
+development to build &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a 
fully open TPC-H data generator over …&lt;/p&gt;</summary><content 
type="html">&lt;!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+--&gt;
+&lt;style&gt;
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+&lt;/style&gt;
+&lt;p&gt;&lt;strong&gt;TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ 
with dbgen&lt;/strong&gt;.&lt;/p&gt;
+&lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
+development to build &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a 
fully open TPC-H data generator over 20x
+faster than any other implementation we know of.&lt;/p&gt;
+&lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
+which takes 30 minutes&lt;sup&gt;1&lt;/sup&gt; (0.05GB/sec). On the same 
machine, it takes less than
+2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format, which 
takes 44 minutes using &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt;.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.&lt;/p&gt;
+&lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 
&lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute and 14 seconds and
+&lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;This blog explains what TPC-H is, how we ported the vintage C data 
generator to
+Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Try it for yourself&lt;/h1&gt;
+&lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
+&lt;pre&gt;&lt;code class="language-shell"&gt;$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+&lt;/code&gt;&lt;/pre&gt;
+&lt;h1&gt;What is TPC-H / dbgen?&lt;/h1&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries, the kind used to build BI
+dashboards.&lt;/p&gt;
+&lt;p&gt;TPC-H has become a de facto standard for analytic systems. While 
there are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
+known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
+search engine.&lt;/p&gt;
+&lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
+systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.&lt;/p&gt;
+&lt;p&gt;TPC-H simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
+&lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.&lt;/p&gt;
+&lt;p&gt;The queries run against data created with &lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program
+written in a pre &lt;a 
href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; dialect, which 
generates data in a format called &lt;em&gt;TBL&lt;/em&gt;
+(example in Figure 2). &lt;code&gt;dbgen&lt;/code&gt; creates data for each of 
the 8 tables for a
+certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly abbreviated as SF. Example 
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.&lt;/p&gt;
+&lt;pre&gt;&lt;code 
class="language-text"&gt;103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
+229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
+263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
+327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR|
 asymptotes are fu|
+450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve.
 asymptote|
+&lt;/code&gt;&lt;/pre&gt;
+&lt;p&gt;&lt;strong&gt;Figure 2&lt;/strong&gt;: Example TBL formatted output 
of &lt;code&gt;dbgen&lt;/code&gt; for the &lt;code&gt;LINEITEM&lt;/code&gt; 
table&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Data Size (TBL)&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Data Size (Parquet)&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;0.1
+   &lt;/td&gt;
+&lt;td&gt;103 Mb
+   &lt;/td&gt;
+&lt;td&gt;31 Mb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;1 Gb
+   &lt;/td&gt;
+&lt;td&gt;340 Mb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;10 Gb
+   &lt;/td&gt;
+&lt;td&gt;3.6 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;107 Gb
+   &lt;/td&gt;
+&lt;td&gt;38 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1000
+   &lt;/td&gt;
+&lt;td&gt;1089 Gb
+   &lt;/td&gt;
+&lt;td&gt;379 Gb
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPC-H data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Why do we need a new TPC-H Data generator?&lt;/h1&gt;
+&lt;p&gt;Despite the known limitations of the TPC-H benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPC-H, you must 
first
+load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
+&lt;li&gt;&lt;code&gt;dbgen&lt;/code&gt; requires substantial time (Figure 3) 
and is not able to use more than one core.&lt;/li&gt;
+&lt;li&gt;It outputs TBL format, which typically requires loading into your 
database (for example, &lt;a 
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124"&gt;here
 is how to do so&lt;/a&gt; in Apache DataFusion) prior to query.&lt;/li&gt;
+&lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.&lt;sup&gt;2&lt;/sup&gt;&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. &lt;code&gt;tpchgen&lt;/code&gt; is
+shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
+&lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
+is shown in yellow.&lt;/p&gt;
+&lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
+preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;Databricks
 Sample
+datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPC-H 
extension&lt;/a&gt; 
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
+Lim&lt;/a&gt; explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPC-H dataset&lt;/a&gt; as one reason
+the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPC-H extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
+&lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
+&lt;li&gt;Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale 
Factor 10)&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and
+above on laptops or standard workstations, though DuckDB offers &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;pre-computed
+files&lt;/a&gt; for larger factors&lt;sup&gt;3&lt;/sup&gt;.&lt;/p&gt;
+&lt;h1&gt;Why Rust?&lt;/h1&gt;
+&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator
+into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.&lt;/p&gt;
+&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII
+strings.&lt;/p&gt;
+&lt;p&gt;&lt;img alt="Lamb Theory on Evolution of Systems Languages" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/lamb-theory.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 4&lt;/strong&gt;: Lamb Theory of System Language 
Evolution from &lt;a 
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf"&gt;Boston 
University
+MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtube.com/watch?v=CpnxuBwHbUc"&gt;recording&lt;/a&gt;. 
Special
+thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
+&lt;h1&gt;How: The Journey&lt;/h1&gt;
+&lt;p&gt;We did it together as a team in the open over the course of a few 
weeks.
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPC-H
+dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
+2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
+and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
+March 30, 2025.&lt;/p&gt;
+&lt;h2&gt;Optimizing Single Threaded Performance&lt;/h2&gt;
+&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance tests&lt;/a&gt;, to ensure correctness, and
+an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt; on March 15, 2025.&lt;/p&gt;
+&lt;p&gt;On a Macbook Pro M3 (Nov 2023), the initial performance numbers were 
actually
+slower than the original Java implementation which was ported 😭. This 
wasn&amp;rsquo;t
+surprising since the focus of the first version was to get a byte of byte
+compatible port, and knew about the performance shortcomings and how to 
approach
+them.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m10.307s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;1m26.530s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;14m56.986s
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial 
tpchgen-cli&lt;/a&gt;, measured with
+&lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;With this strong foundation we began optimizing the code using 
Rust&amp;rsquo;s low level
+memory management to improve performance while retaining memory safely. We 
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/19"&gt;Avoiding startup 
overhead&lt;/a&gt;,&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/26"&gt;not&lt;/a&gt; &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/32"&gt;copying&lt;/a&gt; 
strings (many more PRs as well)&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/27"&gt;Rust&amp;rsquo;s 
zero overhead abstractions for dates&lt;/a&gt;&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/35"&gt;Static 
strings&lt;/a&gt; (entirely safely with static lifetimes)&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/33"&gt;Generics to avoid 
virtual function call overhead&lt;/a&gt;&lt;/li&gt;
+&lt;li&gt;&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/62"&gt;Moving lookups from 
runtime&lt;/a&gt; to load time&lt;/li&gt;
+&lt;/ul&gt;
+&lt;p&gt;At the time of writing, single threaded performance is now 2.5x-2.7x 
faster than the initial version, as shown in Table 3.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m4.079s
+   &lt;/td&gt;
+&lt;td&gt;2.5x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m31.616s
+   &lt;/td&gt;
+&lt;td&gt;2.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;5m28.083s
+   &lt;/td&gt;
+&lt;td&gt;2.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Single threaded 
&lt;code&gt;tpchgen-cli&lt;/code&gt; performance, measured with 
&lt;code&gt;time target/release/tpchgen-cli -s $SCALE_FACTOR 
--num-threads=1&lt;/code&gt;&lt;/p&gt;
+&lt;h2&gt;Multi-threading&lt;/h2&gt;
+&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless concurrency&lt;/a&gt; &amp;ndash; with a single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR&lt;/a&gt; (272
+net new lines) we updated the same memory safe code to run with multiple 
threads
+and consume bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread scheduler&lt;/a&gt;&lt;sup&gt;4&lt;/sup&gt;.&lt;/p&gt;
+&lt;p&gt;As shown in Table 4, with this change, tpchgen-cli generates the full 
SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+&lt;code&gt;dbgen&lt;/code&gt; or &lt;code&gt;duckdb&lt;/code&gt;) &amp;ndash; 
it creates data &lt;strong&gt;faster than can be written to an 
SSD&lt;/strong&gt;.
+When writing to &lt;code&gt;/dev/null&lt;/code&gt; tpchgen  generates the 
entire dataset in 25 seconds
+(4 GB/s).&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster than initial implementation&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Times faster than optimized single 
threaded&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m1.369s
+   &lt;/td&gt;
+&lt;td&gt;7.3x
+   &lt;/td&gt;
+&lt;td&gt;3x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m3.828s
+   &lt;/td&gt;
+&lt;td&gt;22.6x
+   &lt;/td&gt;
+&lt;td&gt;8.2x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;0m32.615s
+   &lt;/td&gt;
+&lt;td&gt;27.5x
+   &lt;/td&gt;
+&lt;td&gt;10x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100 (to /dev/null)
+   &lt;/td&gt;
+&lt;td&gt;0m25.088s
+   &lt;/td&gt;
+&lt;td&gt;35.7x
+   &lt;/td&gt;
+&lt;td&gt;13.1x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 4&lt;/strong&gt;: tpchgen-cli (multithreaded) 
performance measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory
+use does not increase with increasing data size / scale 
factors&lt;sup&gt;5&lt;/sup&gt;. The DuckDB
+generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on
+developer laptops and memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt;
+it is perfectly possible to create data for SF=10000 or larger on a machine 
with
+16GB of memory (assuming sufficient storage capacity).&lt;/p&gt;
+&lt;h2&gt;Direct to parquet&lt;/h2&gt;
+&lt;p&gt;At this point, &lt;code&gt;tpchgen-cli&lt;/code&gt; could very 
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
+&lt;ol&gt;
+&lt;li&gt;It has no header&lt;/li&gt;
+&lt;li&gt;It is like a CSV but the delimiter is 
&lt;code&gt;|&lt;/code&gt;&lt;/li&gt;
+&lt;li&gt;Each line ends with an extra &lt;code&gt;|&lt;/code&gt; delimiter 
before the newline 🙄&lt;/li&gt;
+&lt;li&gt;No system that we know can read them without additional 
configuration.&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;@niebayes&lt;/a&gt; from
+Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still 
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
+Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
+&lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
+300 line PR&lt;/a&gt; to feed those arrays to the &lt;a 
href="https://crates.io/crates/parquet"&gt;Rust Parquet writer&lt;/a&gt;, again 
using
+tokio for parallelized but memory bound work.&lt;/p&gt;
+&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.&lt;/p&gt;
+&lt;table&gt;
+&lt;tr&gt;
+&lt;td&gt;&lt;strong&gt;Scale Factor&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Time to generate Parquet&lt;/strong&gt;
+&lt;/td&gt;
+&lt;td&gt;&lt;strong&gt;Speed compared to tbl generation&lt;/strong&gt;
+&lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;1
+   &lt;/td&gt;
+&lt;td&gt;0m1.649s
+   &lt;/td&gt;
+&lt;td&gt;0.8x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;10
+   &lt;/td&gt;
+&lt;td&gt;0m5.643s
+   &lt;/td&gt;
+&lt;td&gt;0.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100
+   &lt;/td&gt;
+&lt;td&gt;0m45.243s
+   &lt;/td&gt;
+&lt;td&gt;0.7x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;tr&gt;
+&lt;td&gt;100 (to /dev/null)
+   &lt;/td&gt;
+&lt;td&gt;0m45.153s
+   &lt;/td&gt;
+&lt;td&gt;0.5x
+   &lt;/td&gt;
+&lt;/tr&gt;
+&lt;/table&gt;
+&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: 
&lt;code&gt;tpchgen-cli&lt;/code&gt; Parquet generation performance measured 
with  &lt;code&gt;time
+target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&gt;
+&lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
+&lt;p&gt;In just a few days, with some fellow database nerds and the power of 
Rust, we built something 10x better than what currently exists. We hope it 
inspires more research
+into analytical systems using the TPC-H dataset and that people build awesome
+things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
+tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into
+DataFusion (see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
+&lt;h1&gt;About the Authors:&lt;/h1&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt; and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.&lt;/li&gt;
+&lt;li&gt;Achraf B (&lt;a 
href="https://github.com/clflushopt"&gt;@clflushopt&lt;/a&gt;) is a Software 
Engineer at &lt;a href="https://optable.co/"&gt;Optable&lt;/a&gt; where he 
works on data infrastructure.&lt;/li&gt;
+&lt;li&gt;&lt;a href="https://www.linkedin.com/in/scsmithr/"&gt;Sean 
Smith&lt;/a&gt; (&lt;a 
href="https://github.com/scsmithr"&gt;@scsmithr&lt;/a&gt;) is the founder of 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt; focused on building a 
fast analytics database.&lt;/li&gt;
+&lt;/ul&gt;
+&lt;!-- Footnotes themselves at the bottom. --&gt;
+&lt;h2&gt;Footnotes&lt;/h2&gt;
+&lt;p&gt;&lt;em&gt;1&lt;/em&gt;: Actual Time: 
&lt;code&gt;30:35&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;2&lt;/em&gt;: It is possible to embed the dbgen code, which 
appears to be the approach taken by DuckDB. This approach was tried in GlareDB 
(&lt;a 
href="https://github.com/GlareDB/glaredb/pull/3313"&gt;GlareDB/glaredb#3313&lt;/a&gt;),
 but ultimately shelved given the amount of effort needed to adapt and isolate 
the dbgen code.&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;3&lt;/em&gt;: It is pretty amazing to imagine the machine 
required to generate SF300 that had 1.8TB (!!) of RAM&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;4&lt;/em&gt;: We tried to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/34"&gt;use Rayon (see 
discussion here)&lt;/a&gt;, but could not easily keep memory bounded.&lt;/p&gt;
+&lt;p&gt;&lt;em&gt;5&lt;/em&gt;: &lt;code&gt;tpchgen-cli&lt;/code&gt; memory 
usage is a function of the number of threads:  each thread needs some buffer 
space&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Apache DataFusion Python 46.0.0 
Released</title><link 
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0"; 
rel="alternate"></link><published>2025-03-30T00:00:00+00:00</published><updated>2025-03-30T00:00:00+00:00</updated><author><na
 [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/output/images/fastest-tpch-generator/lamb-theory.png 
b/output/images/fastest-tpch-generator/lamb-theory.png
new file mode 100644
index 0000000..2551ffa
Binary files /dev/null and 
b/output/images/fastest-tpch-generator/lamb-theory.png differ
diff --git a/output/images/fastest-tpch-generator/parquet-performance.png 
b/output/images/fastest-tpch-generator/parquet-performance.png
new file mode 100644
index 0000000..462c995
Binary files /dev/null and 
b/output/images/fastest-tpch-generator/parquet-performance.png differ
diff --git a/output/images/fastest-tpch-generator/tbl-performance.png 
b/output/images/fastest-tpch-generator/tbl-performance.png
new file mode 100644
index 0000000..2e64f11
Binary files /dev/null and 
b/output/images/fastest-tpch-generator/tbl-performance.png differ
diff --git a/output/index.html b/output/index.html
index c10e06d..6d3d880 100644
--- a/output/index.html
+++ b/output/index.html
@@ -44,6 +44,53 @@
             <p><i>Here you can find the latest updates from DataFusion and 
related projects.</i></p>
 
 
+    <!-- Post -->
+    <div class="row">
+        <div class="callout">
+            <article class="post">
+                <header>
+                    <div class="title">
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open 
source TPC-H data generator, written in Rust</a></h1>
+                        <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
+                        <p><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with 
dbgen</strong>.</p>
+<p>3 members of the <a href="https://datafusion.apache.org/";>Apache 
DataFusion</a> community used Rust and open source
+development to build <a 
href="https://github.com/clflushopt/tpchgen-rs";>tpchgen-rs</a>, a fully open 
TPC-H data generator over …</p></p>
+                        <footer>
+                            <ul class="actions">
+                                <div style="text-align: right"><a 
href="/blog/2025/04/10/fastest-tpch-generator" class="button medium">Continue 
Reading</a></div>
+                            </ul>
+                            <ul class="stats">
+                            </ul>
+                        </footer>
+            </article>
+        </div>
+    </div>
     <!-- Post -->
     <div class="row">
         <div class="callout">


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-site) branch asf-site updated: Commit build products

Reply via email to