This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git


The following commit(s) were added to refs/heads/asf-staging by this push:
     new ec12907  Commit build products
ec12907 is described below

commit ec12907893ec5a1a5e59423350c68f3b39449d93
Author: Build Pelican (action) <priv...@infra.apache.org>
AuthorDate: Fri Apr 4 20:29:30 2025 +0000

    Commit build products
---
 blog/2025/04/10/fastest-tpch-generator/index.html  | 181 +++++++++++----------
 .../andrew-lamb-achraf-b-and-sean-smith.html       |   2 +-
 blog/category/blog.html                            |   2 +-
 blog/feed.xml                                      |   2 +-
 blog/feeds/all-en.atom.xml                         | 179 ++++++++++----------
 .../andrew-lamb-achraf-b-and-sean-smith.atom.xml   | 179 ++++++++++----------
 .../andrew-lamb-achraf-b-and-sean-smith.rss.xml    |   2 +-
 blog/feeds/blog.atom.xml                           | 179 ++++++++++----------
 blog/index.html                                    |   2 +-
 9 files changed, 394 insertions(+), 334 deletions(-)

diff --git a/blog/2025/04/10/fastest-tpch-generator/index.html 
b/blog/2025/04/10/fastest-tpch-generator/index.html
index c25b3fa..9d6f2aa 100644
--- a/blog/2025/04/10/fastest-tpch-generator/index.html
+++ b/blog/2025/04/10/fastest-tpch-generator/index.html
@@ -4,7 +4,7 @@
     <meta charset="utf-8">
     <meta http-equiv="x-ua-compatible" content="ie=edge">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>tpchgen-rs: World’s fastest open source TPCH data generator, 
written in Rust - Apache DataFusion Blog</title>
+    <title>`tpchgen-rs` World’s fastest open source TPCH data generator, 
written in Rust - Apache DataFusion Blog</title>
 <link href="/blog/css/bootstrap.min.css" rel="stylesheet">
 <link href="/blog/css/fontawesome.all.min.css" rel="stylesheet">
 <link href="/blog/css/headerlink.css" rel="stylesheet">
@@ -40,7 +40,7 @@
     <div class="bg-white p-5 rounded">
         <div class="col-sm-8 mx-auto">
           <h1>
-              tpchgen-rs: World’s fastest open source TPCH data generator, 
written in Rust
+              `tpchgen-rs` World’s fastest open source TPCH data generator, 
written in Rust
           </h1>
               <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf B, and 
Sean Smith</p>
               <!--
@@ -72,7 +72,7 @@ th, td {
 <p>We used Rust and open source development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open
 TPCH data generator over 10x faster than any other implementation  we know 
of.</p>
 <p>About the Authors:
-- <a href="https://www.linkedin.com/in/andrewalamb/";>Andrew Lamb</a> (<a 
href="https://github.com/alamb";>@alamb</a>) is a Staff Engineer at <a 
href="https://www.influxdata.com/";>InfluxData</a>) and a PMC member of <a 
href="https://datafusion.apache.org/";>Apache DataFusion</a> and <a 
href="https://arrow.apache.org/";>Apache Arrow</a>.
+- <a href="https://www.linkedin.com/in/andrewalamb/";>Andrew Lamb</a> (<a 
href="https://github.com/alamb";>@alamb</a>) is a Staff Engineer at <a 
href="https://www.influxdata.com/";>InfluxData</a> and a PMC member of <a 
href="https://datafusion.apache.org/";>Apache DataFusion</a> and <a 
href="https://arrow.apache.org/";>Apache Arrow</a>.
 - Achraf B (<a href="https://github.com/clflushopt";>@clflushopt</a>) is a 
Software Engineer at <a href="https://optable.co/";>Optable</a> where he works 
on data infrastructure.
 - <a href="https://www.linkedin.com/in/scsmithr/";>Sean Smith</a> (<a 
href="https://github.com/scsmithr";>@scsmithr</a>) is the founder of  focused on 
building a fast analytics database.</p>
 <p>It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 
GB/s
@@ -85,16 +85,14 @@ analytical engines such as DataFusion.</p>
 <p><strong>Figure 1</strong>: Time to create TPCH dataset for Scale Factor 
(see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
 VM. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 minute and 14 
seconds and
-<a href="https://duckdb.org/";>DuckDB</a> takes 17 minutes and 48 seconds. For 
SF=1000,
-<code>tpchgen</code> takes 10 minutes and 26 and uses about 5 GB of RAM at 
peak, and we
-could not measure DuckDB&rsquo;s time as it <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator";>requires
 647 GB of
-RAM</a>,
-more than the 88 GB that was available on our test machine. The testing
-methodology is in the
+<a href="https://duckdb.org";>DuckDB</a> takes 17 minutes and 48 seconds. For 
SF=1000, <code>tpchgen</code> takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&rsquo;s time as it <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator";>requires
 647 GB of RAM</a>, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
 <a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md";>documentation</a>.</p>
 <p>This blog explains what TPCH is, how we ported the vintage C data generator 
to
 Rust (yes, <a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/";>RWIR</a>)
 and optimized its performance over the course of a few weeks
-of part time work. We began this project so we can easily generate TPCH data in
+of part-time work. We began this project so we can easily generate TPCH data in
 <a href="https://datafusion.apache.org/";>Apache DataFusion</a> and <a 
href="https://glaredb.com/";>GlareDB</a>.</p>
 <h1>Try if for yourself</h1>
 <p>The tool is entirely open source under the <a 
href="https://www.apache.org/licenses/LICENSE-2.0";>Apache 2.0 license</a>. 
Visit the <a href="https://github.com/clflushopt/tpchgen-rs";>tpchgen-rs 
repository</a> or try it for yourself by run the following commands after <a 
href="https://www.rust-lang.org/tools/install";>installing Rust</a>:</p>
@@ -107,33 +105,31 @@ $ tpchgen-cli -s <span class="m">1</span>
 $ tpchgen-cli -s <span class="m">10</span> --format<span 
class="o">=</span>parquet
 </code></pre></div>
 <h1>What is TPCH / dbgen?</h1>
-<p>The popular <a href="https://www.tpc.org/tpch/";>TPC-H</a> benchmark 
(commonly referred to
-as TPCH) helps evaluate the performance of database systems on
-<a href="https://en.wikipedia.org/wiki/Online_analytical_processing";>OLAP</a> 
queries<em>,
-</em>the kind used to build BI dashboards.</p>
+<p>The popular <a href="https://www.tpc.org/tpch/";>TPC-H</a> benchmark 
(commonly referred to as TPCH) helps evaluate the
+performance of database systems on <a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing";>OLAP</a> 
queries<em>, </em>the kind used to build BI
+dashboards.</p>
 <p>TPCH has become a de facto standard for analytic systems. While there are 
<a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf";>well
-known</a> limitations as the data
-and queries do not well represent many real world use cases, the majority of
-analytic database papers and industrial systems still use TPCH query 
performance
-benchmarks as a baseline. You will inevitably find multiple results for  
&ldquo;<code>TPCH
-Performance &amp;lt;your favorite database&gt;</code>&rdquo; in any search 
engine.</p>
+known</a> limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  &ldquo;<code>TPCH Performance &amp;lt;your favorite 
database&gt;</code>&rdquo; in any
+search engine.</p>
 <p>The benchmark was created at a time when access to high performance 
analytical
-systems was not widespread, so the <a href="https://www.tpc.org/";>Transaction 
Processing Performance
-Council</a> defined a process of formal result verification.
-More recently, given the broad availability of free and open source database
-systems, it is common for users to run and verify TPCH performance 
themselves.</p>
+systems was not widespread, so the <a href="https://www.tpc.org/";>Transaction 
Processing Performance Council</a>
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.</p>
 <p>TPCH simulates a business environment with eight tables: 
<code>REGION</code>, <code>NATION</code>,
 <code>SUPPLIER</code>, <code>CUSTOMER</code>, <code>PART</code>, 
<code>PARTSUPP</code>, <code>ORDERS</code>, and <code>LINEITEM</code>. These
 tables are linked by foreign keys in a normalized schema representing a supply
 chain with parts, suppliers, customers and orders. The benchmark itself is 22
 SQL queries containing joins, aggregations, and sorting operations.</p>
-<p>The queries run against data created with
-<code><a href="https://github.com/electrum/tpch-dbgen";>dbgen</a></code>, a 
program written
-in a pre <a href="https://en.wikipedia.org/wiki/C99";>C-99</a> dialect, which 
generates data
-in a format called <em>TBL</em> (example in Figure 2). <code>dbgen</code> 
creates data for each of
-the 8 tables for a certain <em>Scale Factor</em>, commonly abbreviated as SF. 
Example
-Scale Factors and corresponding dataset sizes are shown in Table 1. There is no
-theoretical upper bound on the Scale Factor.</p>
+<p>The queries run against data created with <code><a 
href="https://github.com/electrum/tpch-dbgen)">dbgen</a></code>, a program
+written in a pre <a href="https://en.wikipedia.org/wiki/C99";>C-99</a> dialect, 
which generates data in a format called <em>TBL</em>
+(example in Figure 2). <code>dbgen</code> creates data for each of the 8 
tables for a
+certain <em>Scale Factor</em>, commonly abbreviated as SF. Example Scale 
Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.</p>
 <div 
class="codehilite"><pre><span></span><code>103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
 
229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
 
263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
@@ -193,7 +189,9 @@ theoretical upper bound on the Scale Factor.</p>
 </table>
 <p><strong>Table 1</strong>: TPCH data set sizes at different scale factors 
for both TBL and <a href="https://parquet.apache.org/";>Apache Parquet</a>.</p>
 <h1>Why do we need a new TPCH Data generator?</h1>
-<p>Despite the known limitations of the TPCH benchmark, it is so well known 
that it is used frequently in database performance analysis. To run TPCH, you 
must first load the data, using <code>dbgen</code>, which is not ideal for 
several reasons:</p>
+<p>Despite the known limitations of the TPCH benchmark, it is so well known 
that it
+is used frequently in database performance analysis. To run TPCH, you must 
first
+load the data, using <code>dbgen</code>, which is not ideal for several 
reasons:</p>
 <ol>
 <li>You must find and compile a copy of the 15+ year old C program (for 
example <a 
href="https://github.com/electrum/tpch-dbgen";>electrum/tpch-dbgen</a>)</li>
 <li><code>dbgen</code> requires substantial time (Figure 3) and is not able to 
use more than one core.</li>
@@ -206,53 +204,48 @@ shown in blue. <code>tpchgen</code> restricted to a 
single core is shown in red.
 <code>dbgen</code> is shown in green and <code>dbgen</code> modified to use 
<code>-O3</code> optimization level
 is shown in yellow.</p>
 <p><code>dbgen</code> is so inconvenient and takes so long that vendors often 
provide
-pre-loaded TPCH data, for example <a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch";>Snowflake 
Sample
-Data</a>, <a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets";>DataBricks
-Sample
-datasets</a> and
-<a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets";>DuckDB
 Pre-Generated Data
-Sets</a>.</p>
-<p>In addition to pre-generated datasets, DuckDB also provides a
-<a href="https://duckdb.org/docs/stable/extensions/tpch.html";>TPCH</a> 
extension for
-generating TPCH datasets within DuckDB. This is so much easier to use than the
-current alternatives that it leads many researchers and other thought leaders 
to
-use DuckDB to evaluate new ideas. For example, <a 
href="https://github.com/lmwnshn";>Wan Shen
-Lim</a> explicitly <a 
href="https://github.com/apache/datafusion/issues/14373";>mentioned the ease of 
creating the
-TPCH dataset</a> as one reason
-the first student project of <a 
href="https://15799.courses.cs.cmu.edu/spring2025/";>CMU-799 Spring
-2025</a> used DuckDB.</p>
-<p>As beneficial as the DuckDB TPCH extension is, it is non ideal for several 
reasons:</p>
+pre-loaded TPCH data, for example [Snowflake Sample Data], <a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets";>DataBricks
 Sample
+datasets</a> and <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets";>DuckDB
 Pre-Generated Data Sets</a>.</p>
+<p>In addition to pre-generated datasets, DuckDB also provides a <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html";>TPCH extension</a> 
+for generating TPCH datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, <a 
href="https://github.com/lmwnshn";>Wan Shen
+Lim</a>explicitly <a 
href="https://github.com/apache/datafusion/issues/14373";>mentioned the ease of 
creating the TPCH dataset</a> as one reason
+the first student project of <a 
href="https://15799.courses.cs.cmu.edu/spring2025/";>CMU-799 Spring 2025</a> 
used DuckDB.</p>
+<p>As beneficial as the DuckDB TPCH extension is, it is non-ideal for several 
reasons:</p>
 <ol>
 <li>Creates data in a proprietary format, which requires export to use in 
other systems.</li>
 <li>Requires significant time (e.g. 17 minutes for Scale Factor 10).</li>
 <li>Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale 
Factor 10)</li>
 </ol>
-<p>The above limitations makes it impractical to generate Scale Factor 100 and 
above on laptops or standard workstations, though DuckDB offers<a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets";>
 pre-computed files</a> for larger factors[^3].</p>
+<p>The above limitations makes it impractical to generate Scale Factor 100 and
+above on laptops or standard workstations, though DuckDB offers <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets";>pre-computed
+files</a> for larger factors[^3].</p>
 <h1>Why Rust?</h1>
-<p>Realistically we used Rust because we wanted to integrate the data 
generator into <a href="https://datafusion.apache.org/";>Apache DataFusion</a> 
and <a href="https://glaredb.com/";>GlareDB</a>. However, we also believe Rust 
is superior to C/C++ due to its comparable performance, but much higher 
programmer productivity (Figure 4). Productivity in this case refers to the 
ease of optimizing and adding multithreading without introducing hard to debug 
memory safety or concurrency issues.</p>
-<p>While Rust does allow unsafe access to memory (eliding bounds checking, for 
example), when required for performance, our implementation is entirely memory 
safe. The only <a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;type=code";>unsafe</a>
 code is used to <a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72";>skip</a>
 UTF8 validation on known ASCII strings.</p>
+<p>Realistically we used Rust because we wanted to integrate the data generator
+into <a href="https://datafusion.apache.org/";>Apache DataFusion</a> and <a 
href="https://glaredb.com/";>GlareDB</a>. However, we also believe Rust is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.</p>
+<p>While Rust does allow unsafe access to memory (eliding bounds checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only <a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;type=code";>unsafe</a>
 code is used to <a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72";>skip</a>
 UTF8 validation on known ASCII
+strings.</p>
 <p><img alt="Lamb Theory on Evolution of Systems Languages" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/lamb-theory.png" width="80%"/></p>
 <p><strong>Figure 4</strong>: Lamb Theory of System Language Evolution from <a 
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf";>Boston 
University
 MiDAS Fall 2024 (Data Systems Seminar)</a>, <a 
href="https://www.youtube.com/watch?v=CpnxuBwHbUc";>recording</a>. Special
 thanks to <a href="https://x.com/KurtFehlhauer";>@KurtFehlhauer</a></p>
 <h1>How: The Journey</h1>
 <p>We did it together as a team in the open over the course of a few weeks 
weeks.
-<a href="https://github.com/lmwnshn";>Wan Shen Lim</a> inspired the project by 
pointing out
-the benefits of <a 
href="https://github.com/apache/datafusion/issues/14373";>easy TPCH dataset
-creation</a>  and <a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600";>suggesting
 we
-check out a Java port on February 11,
-2025</a>.
-Achraf made <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0";>first
 commit a few days
-later</a>
-on February 16, and <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e";>Andrew
 and Sean started helping on March 8,
-2025</a>
-and we <a href="https://crates.io/crates/tpchgen/0.1.0";>released version 
0.1</a> on March
-30, 2025.</p>
+<a href="https://github.com/lmwnshn";>Wan Shen Lim</a> inspired the project by 
pointing out the benefits of <a 
href="https://github.com/apache/datafusion/issues/14373";>easy TPCH
+dataset creation</a>  and <a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600";>suggesting
 we check out a Java port on February 11,
+2025</a>. Achraf made <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0";>first
 commit a few days later</a> on February 16, and <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e";>Andrew
+and Sean started helping on March 8, 2025</a> and we <a 
href="https://crates.io/crates/tpchgen/0.1.0";>released version 0.1</a> on
+March 30, 2025.</p>
 <h2>Optimizing Single Threaded Performance</h2>
-<p>Archaf <a href="https://github.com/clflushopt/tpchgen-rs/pull/16";>completed 
the end to end conformance
-tests</a>, to ensure correctness,
-and an initial <a href="https://github.com/clflushopt/tpchgen-rs/pull/12";>cli 
check in</a>
-on March 15, 2025.</p>
+<p>Archaf <a href="https://github.com/clflushopt/tpchgen-rs/pull/16";>completed 
the end to end conformance tests</a>, to ensure correctness, and
+an initial <a href="https://github.com/clflushopt/tpchgen-rs/pull/12";>cli 
check in</a> on March 15, 2025.</p>
 <p>On a Macbook Pro M3 (Nov 2023), the initial performance numbers were 
actually
 slower than the original Java implementation which was ported 😭. This 
wasn&rsquo;t
 surprising since the focus of the first version was to get a byte of byte
@@ -284,8 +277,7 @@ approach them.</p>
    </td>
 </tr>
 </table>
-<p><strong>Table 2</strong>: Performance of running <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12";>the initial
-tpchgen-cli</a>, measured with
+<p><strong>Table 2</strong>: Performance of running <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12";>the initial 
tpchgen-cli</a>, measured with
 <code>time target/release/tpchgen-cli -s $SCALE_FACTOR</code></p>
 <p>With this strong foundation we began optimizing the code using Rust&rsquo;s 
low level
 memory management to improve performance while retaining memory safely. We 
spent
@@ -334,15 +326,11 @@ list of optimizations:</p>
    </td>
 </tr>
 </table>
-<p><strong>Table 3</strong>: Current single threaded  tpchgen-cli, measured 
with <code>time target/release/tpchgen-cli -s $SCALE_FACTOR 
--num-threads=1</code></p>
+<p><strong>Table 3</strong>: Single threaded <code>tpchgen-cli</code> 
performance, measured with <code>time target/release/tpchgen-cli -s 
$SCALE_FACTOR --num-threads=1</code></p>
 <h2>Multi-threading</h2>
-<p>Then we applied <a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html";>Rust&rsquo;s 
fearless
-concurrency</a> &ndash; with a
-single, <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499";>small
 PR (272 net new lines)
-</a>we
-updated the same memory safe code to run with multiple threads and consume
-bounded memory using <a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/";>tokio
 for the thread
-scheduler</a>[^4].</p>
+<p>Then we applied <a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html";>Rust&rsquo;s 
fearless concurrency</a> &ndash; with a single, <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499";>small
 PR</a> (272
+net new lines) we updated the same memory safe code to run with multiple 
threads
+and consume bounded memory using <a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/";>tokio
 for the thread scheduler</a>)[^4].</p>
 <p>As shown in Table 4, with this change, tpchgen-cli generates the full SF=100
 dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
 that at SF=100 our generator is actually IO bound (which is not the case for
@@ -402,18 +390,36 @@ When writing to <code>/dev/null</code> tpchgen  generates 
the entire dataset in
 </tr>
 </table>
 <p><strong>Table 4</strong>: tpchgen-cli (multithreaded) performance measured 
with <code>time target/release/tpchgen-cli -s $SCALE_FACTOR</code></p>
-<p>Using Rust and async streams, the data generator is also fully streaming: 
memory use does not increase with increasing data size / scale factors[^5]. The 
DuckDB generator seems to <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator";>require
 far more memory</a> than is commonly available on developer laptops and memory 
use increases with scale factor. With <code>tpchgen-cli</code> it is perfectly 
possible to create data for SF=10000 or larg [...]
+<p>Using Rust and async streams, the data generator is also fully streaming: 
memory
+use does not increase with increasing data size / scale factors[^5]. The DuckDB
+generator seems to <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator";>require
 far more memory</a> than is commonly available on
+developer laptops and memory use increases with scale factor. With 
<code>tpchgen-cli</code>
+it is perfectly possible to create data for SF=10000 or larger on a machine 
with
+16GB of memory (assuming sufficient storage capacity).</p>
 <h2>Direct to parquet</h2>
-<p>At this point, tpchgen-cli could very quickly generate the TBL format. 
However, as described above, the TBL is annoying to work with, because</p>
+<p>At this point, <code>tpchgen-cli</code> could very quickly generate the TBL 
format.
+However, as described above, the TBL is annoying to work with, because</p>
 <ol>
 <li>It has no header</li>
 <li>It is like a CSV but the delimiter is &lsquo;|`</li>
 <li>Each line ends with an extra <code>|</code> delimiter before the newline 
🙄</li>
 <li>No system that we know can read them without additional configuration.</li>
 </ol>
-<p>We next <a href="https://github.com/clflushopt/tpchgen-rs/pull/54";>added 
support for CSV</a> generation (special thanks <a 
href="https://github.com/niebayes";>niebayes</a> from Datalayers for finding and 
<a href="https://github.com/clflushopt/tpchgen-rs/pull/66";>fixing</a> <a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73";>several</a> <a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65";>bugs</a>) which 
performs at the same speed as TBL. While CSV files are far more st [...]
-<p>What would be far more useful is a typed, efficient columnar format such as 
Apache Parquet which is supported by all modern query engines. So we <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71";>made</a> a <a 
href="https://crates.io/crates/tpchgen-arrow";>tpchgen-arrow</a> crate to 
create<a href="https://arrow.apache.org/";> Apache Arrow</a> arrays directly and 
then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61";>a small 300 
line PR</a> to feed those arrays to the <a [...]
-<p>This approach was simple, fast and scalable, as shown in Table 5. Even 
though creating Parquet files is significantly more computationally expensive 
than TBL or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in 
less than 45 seconds.</p>
+<p>We next <a href="https://github.com/clflushopt/tpchgen-rs/pull/54";>added 
support for CSV</a> generation (special thanks <a 
href="https://github.com/niebayes";>@niebayes</a> from
+Datalayers for finding and <a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73";>fixing</a> <a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65";>bugs</a>) which 
performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still 
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPCH benchmarks (e.g. floating point vs Decimal).</p>
+<p>What would be far more useful is a typed, efficient columnar format such as
+Apache Parquet which is supported by all modern query engines. So we <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71";>made</a> a
+<a href="https://crates.io/crates/tpchgen-arrow";>tpchgen-arrow</a> crate to 
create <a href="https://arrow.apache.org/";>Apache Arrow</a> arrays directly and 
then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61";>a small
+300 line PR</a> to feed those arrays to the <a 
href="https://crates.io/crates/parquet";>Rust Parquet writer</a>, again using
+tokio for parallelized but memory bound work.</p>
+<p>This approach was simple, fast and scalable, as shown in Table 5. Even 
though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.</p>
+<p><a href="https://arrow.apache.org/";>Apache 
Arrow</a>(https://arrow.apache.org/</p>
 <table>
 <tr>
 <td>Scale Factor
@@ -456,10 +462,19 @@ When writing to <code>/dev/null</code> tpchgen  generates 
the entire dataset in
    </td>
 </tr>
 </table>
-<p><strong>Table 5</strong>: tpchgen-cli parquet generation performance 
measured with  <code>time target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet</code></p>
+<p><strong>Table 5</strong>: <code>tpchgen-cli</code> Parquet generation 
performance measured with  <code>time
+target/release/tpchgen-cli -s $SCALE_FACTOR --format=parquet</code></p>
 <h1>Conclusion 👊🎤</h1>
-<p>With a few days, some fellow database nerds, and the power of Rust we made 
something 10x better than currently exists. We hope it inspires more research 
into analytical systems using the TPCH dataset and that people build awesome 
things with it. For example, Sean has already added <a 
href="https://github.com/GlareDB/glaredb/pull/3549";>on-demand generation of 
tables to GlareDB</a>. Please consider joining us and helping out at <a 
href="https://github.com/clflushopt/tpchgen-rs";>https:// [...]
-<p>We met while working together on Apache DataFusion in various capacities. 
If you are looking for a community of like minded people hacking on databases, 
we welcome you to <a 
href="https://datafusion.apache.org/contributor-guide/communication.html";>come 
join us</a>. We are in the process of integrating this into DataFusion (see <a 
href="https://github.com/apache/datafusion/issues/14608";>apache/datafusion#14608</a>)
 if you are interested in helping 🎣</p>
+<p>With a few days, some fellow database nerds, and the power of Rust we made
+something 10x better than currently exists. We hope it inspires more research
+into analytical systems using the TPCH dataset and that people build awesome
+things with it. For example, Sean has already added <a 
href="https://github.com/GlareDB/glaredb/pull/3549";>on-demand generation of
+tables to GlareDB</a>. Please consider joining us and helping out at
+<a 
href="https://github.com/clflushopt/tpchgen-rs";>https://github.com/clflushopt/tpchgen-rs</a>.</p>
+<p>We met while working together on Apache DataFusion in various capacities. 
If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to <a 
href="https://datafusion.apache.org/contributor-guide/communication.html";>come 
join us</a>. We are in the process of integrating this into
+DataFusion (see <a 
href="https://github.com/apache/datafusion/issues/14608";>apache/datafusion#14608</a>)
 if you are interested in helping 🎣</p>
 <!-- Footnotes themselves at the bottom. -->
 <h2>Notes</h2>
 <p>[^2]: It is possible to embed the dbgen code, which appears to be the 
approach taken by DuckDB. This approach was tried in GlareDB (<a 
href="https://github.com/GlareDB/glaredb/pull/3313";>GlareDB/glaredb#3313</a>), 
but ultimately shelved given the amount of effort needed to adapt and isolate 
the dbgen code.</p>
diff --git a/blog/author/andrew-lamb-achraf-b-and-sean-smith.html 
b/blog/author/andrew-lamb-achraf-b-and-sean-smith.html
index fda91e3..470f3f1 100644
--- a/blog/author/andrew-lamb-achraf-b-and-sean-smith.html
+++ b/blog/author/andrew-lamb-achraf-b-and-sean-smith.html
@@ -53,7 +53,7 @@
             <article class="post">
                 <header>
                     <div class="title">
-                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs: World’s fastest open 
source TPCH data generator, written in Rust</a></h1>
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest 
open source TPCH data generator, written in Rust</a></h1>
                         <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
                         <p><!--
 {% comment %}
diff --git a/blog/category/blog.html b/blog/category/blog.html
index 42da471..d940f4d 100644
--- a/blog/category/blog.html
+++ b/blog/category/blog.html
@@ -53,7 +53,7 @@
             <article class="post">
                 <header>
                     <div class="title">
-                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs: World’s fastest open 
source TPCH data generator, written in Rust</a></h1>
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest 
open source TPCH data generator, written in Rust</a></h1>
                         <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
                         <p><!--
 {% comment %}
diff --git a/blog/feed.xml b/blog/feed.xml
index 2548b71..0818ba8 100644
--- a/blog/feed.xml
+++ b/blog/feed.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion 
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs: World’s 
fastest open source TPCH data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
+<rss version="2.0"><channel><title>Apache DataFusion 
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>`tpchgen-rs` World’s 
fastest open source TPCH data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml
index aeb7b05..7c96636 100644
--- a/blog/feeds/all-en.atom.xml
+++ b/blog/feeds/all-en.atom.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion 
Blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs:
 World’s fastest open source TPCH data generator, written in Rust</title><link 
href="https://datafusion.apache.org/blog [...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion 
Blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs`
 World’s fastest open source TPCH data generator, written in Rust</title><link 
href="https://datafusion.apache.org/blo [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -57,7 +57,7 @@ th, td {
 &lt;p&gt;We used Rust and open source development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully open
 TPCH data generator over 10x faster than any other implementation  we know 
of.&lt;/p&gt;
 &lt;p&gt;About the Authors:
-- &lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt;) and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.
+- &lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt; and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.
 - Achraf B (&lt;a 
href="https://github.com/clflushopt"&gt;@clflushopt&lt;/a&gt;) is a Software 
Engineer at &lt;a href="https://optable.co/"&gt;Optable&lt;/a&gt; where he 
works on data infrastructure.
 - &lt;a href="https://www.linkedin.com/in/scsmithr/"&gt;Sean Smith&lt;/a&gt; 
(&lt;a href="https://github.com/scsmithr"&gt;@scsmithr&lt;/a&gt;) is the 
founder of  focused on building a fast analytics database.&lt;/p&gt;
 &lt;p&gt;It is now possible to create the TPCH SF=100 dataset in 72.23 seconds 
(1.4 GB/s
@@ -70,16 +70,14 @@ analytical engines such as DataFusion.&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPCH dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
 VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
-&lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000,
-&lt;code&gt;tpchgen&lt;/code&gt; takes 10 minutes and 26 and uses about 5 GB 
of RAM at peak, and we
-could not measure DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of
-RAM&lt;/a&gt;,
-more than the 88 GB that was available on our test machine. The testing
-methodology is in the
+&lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
 &lt;p&gt;This blog explains what TPCH is, how we ported the vintage C data 
generator to
 Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
-of part time work. We began this project so we can easily generate TPCH data in
+of part-time work. We began this project so we can easily generate TPCH data in
 &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Try if for yourself&lt;/h1&gt;
 &lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
@@ -92,33 +90,31 @@ $ tpchgen-cli -s &lt;span class="m"&gt;1&lt;/span&gt;
 $ tpchgen-cli -s &lt;span class="m"&gt;10&lt;/span&gt; --format&lt;span 
class="o"&gt;=&lt;/span&gt;parquet
 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
 &lt;h1&gt;What is TPCH / dbgen?&lt;/h1&gt;
-&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to
-as TPCH) helps evaluate the performance of database systems on
-&lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;,
-&lt;/em&gt;the kind used to build BI dashboards.&lt;/p&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;, &lt;/em&gt;the kind used to build BI
+dashboards.&lt;/p&gt;
 &lt;p&gt;TPCH has become a de facto standard for analytic systems. While there 
are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
-known&lt;/a&gt; limitations as the data
-and queries do not well represent many real world use cases, the majority of
-analytic database papers and industrial systems still use TPCH query 
performance
-benchmarks as a baseline. You will inevitably find multiple results for  
&amp;ldquo;&lt;code&gt;TPCH
-Performance &amp;amp;lt;your favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; 
in any search engine.&lt;/p&gt;
+known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
+search engine.&lt;/p&gt;
 &lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
-systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance
-Council&lt;/a&gt; defined a process of formal result verification.
-More recently, given the broad availability of free and open source database
-systems, it is common for users to run and verify TPCH performance 
themselves.&lt;/p&gt;
+systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.&lt;/p&gt;
 &lt;p&gt;TPCH simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
 &lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
 tables are linked by foreign keys in a normalized schema representing a supply
 chain with parts, suppliers, customers and orders. The benchmark itself is 22
 SQL queries containing joins, aggregations, and sorting operations.&lt;/p&gt;
-&lt;p&gt;The queries run against data created with
-&lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program written
-in a pre &lt;a href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; 
dialect, which generates data
-in a format called &lt;em&gt;TBL&lt;/em&gt; (example in Figure 2). 
&lt;code&gt;dbgen&lt;/code&gt; creates data for each of
-the 8 tables for a certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly 
abbreviated as SF. Example
-Scale Factors and corresponding dataset sizes are shown in Table 1. There is no
-theoretical upper bound on the Scale Factor.&lt;/p&gt;
+&lt;p&gt;The queries run against data created with &lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen)"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program
+written in a pre &lt;a 
href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; dialect, which 
generates data in a format called &lt;em&gt;TBL&lt;/em&gt;
+(example in Figure 2). &lt;code&gt;dbgen&lt;/code&gt; creates data for each of 
the 8 tables for a
+certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly abbreviated as SF. Example 
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.&lt;/p&gt;
 &lt;div 
class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
 
229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
 
263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
@@ -178,7 +174,9 @@ theoretical upper bound on the Scale Factor.&lt;/p&gt;
 &lt;/table&gt;
 &lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPCH data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Why do we need a new TPCH Data generator?&lt;/h1&gt;
-&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it is used frequently in database performance analysis. To run TPCH, 
you must first load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is 
not ideal for several reasons:&lt;/p&gt;
+&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPCH, you must 
first
+load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
 &lt;li&gt;&lt;code&gt;dbgen&lt;/code&gt; requires substantial time (Figure 3) 
and is not able to use more than one core.&lt;/li&gt;
@@ -191,53 +189,48 @@ shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; 
restricted to a single core is s
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-pre-loaded TPCH data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample
-Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
-Sample
-datasets&lt;/a&gt; and
-&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data
-Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a
-&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH&lt;/a&gt; 
extension for
-generating TPCH datasets within DuckDB. This is so much easier to use than the
-current alternatives that it leads many researchers and other thought leaders 
to
-use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
-Lim&lt;/a&gt; explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the
-TPCH dataset&lt;/a&gt; as one reason
-the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring
-2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
-&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non ideal for 
several reasons:&lt;/p&gt;
+pre-loaded TPCH data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH 
extension&lt;/a&gt; 
+for generating TPCH datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
+Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPCH dataset&lt;/a&gt; as one reason
+the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
 &lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
 &lt;li&gt;Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale 
Factor 10)&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and above on laptops or standard workstations, though DuckDB offers&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;
 pre-computed files&lt;/a&gt; for larger factors[^3].&lt;/p&gt;
+&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and
+above on laptops or standard workstations, though DuckDB offers &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;pre-computed
+files&lt;/a&gt; for larger factors[^3].&lt;/p&gt;
 &lt;h1&gt;Why Rust?&lt;/h1&gt;
-&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is superior to C/C++ due to its comparable performance, but much higher 
programmer productivity (Figure 4). Productivity in this case refers to the 
ease of optimizing and adding multithreading without introducing hard to debug 
memory sa [...]
-&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for example), when required for performance, our implementation is 
entirely memory safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII strin [...]
+&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator
+into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.&lt;/p&gt;
+&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII
+strings.&lt;/p&gt;
 &lt;p&gt;&lt;img alt="Lamb Theory on Evolution of Systems Languages" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/lamb-theory.png" 
width="80%"/&gt;&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 4&lt;/strong&gt;: Lamb Theory of System Language 
Evolution from &lt;a 
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf"&gt;Boston 
University
 MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtube.com/watch?v=CpnxuBwHbUc"&gt;recording&lt;/a&gt;. 
Special
 thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
 &lt;h1&gt;How: The Journey&lt;/h1&gt;
 &lt;p&gt;We did it together as a team in the open over the course of a few 
weeks weeks.
-&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out
-the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH dataset
-creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we
-check out a Java port on February 11,
-2025&lt;/a&gt;.
-Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days
-later&lt;/a&gt;
-on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
 and Sean started helping on March 8,
-2025&lt;/a&gt;
-and we &lt;a href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 
0.1&lt;/a&gt; on March
-30, 2025.&lt;/p&gt;
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH
+dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
+2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
+and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
+March 30, 2025.&lt;/p&gt;
 &lt;h2&gt;Optimizing Single Threaded Performance&lt;/h2&gt;
-&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance
-tests&lt;/a&gt;, to ensure correctness,
-and an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt;
-on March 15, 2025.&lt;/p&gt;
+&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance tests&lt;/a&gt;, to ensure correctness, and
+an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt; on March 15, 2025.&lt;/p&gt;
 &lt;p&gt;On a Macbook Pro M3 (Nov 2023), the initial performance numbers were 
actually
 slower than the original Java implementation which was ported 😭. This 
wasn&amp;rsquo;t
 surprising since the focus of the first version was to get a byte of byte
@@ -269,8 +262,7 @@ approach them.&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial
-tpchgen-cli&lt;/a&gt;, measured with
+&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial 
tpchgen-cli&lt;/a&gt;, measured with
 &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
 &lt;p&gt;With this strong foundation we began optimizing the code using 
Rust&amp;rsquo;s low level
 memory management to improve performance while retaining memory safely. We 
spent
@@ -319,15 +311,11 @@ list of optimizations:&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Current single threaded  
tpchgen-cli, measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR --num-threads=1&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Single threaded 
&lt;code&gt;tpchgen-cli&lt;/code&gt; performance, measured with 
&lt;code&gt;time target/release/tpchgen-cli -s $SCALE_FACTOR 
--num-threads=1&lt;/code&gt;&lt;/p&gt;
 &lt;h2&gt;Multi-threading&lt;/h2&gt;
-&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless
-concurrency&lt;/a&gt; &amp;ndash; with a
-single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR (272 net new lines)
-&lt;/a&gt;we
-updated the same memory safe code to run with multiple threads and consume
-bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread
-scheduler&lt;/a&gt;[^4].&lt;/p&gt;
+&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless concurrency&lt;/a&gt; &amp;ndash; with a single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR&lt;/a&gt; (272
+net new lines) we updated the same memory safe code to run with multiple 
threads
+and consume bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread scheduler&lt;/a&gt;)[^4].&lt;/p&gt;
 &lt;p&gt;As shown in Table 4, with this change, tpchgen-cli generates the full 
SF=100
 dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
 that at SF=100 our generator is actually IO bound (which is not the case for
@@ -387,18 +375,36 @@ When writing to &lt;code&gt;/dev/null&lt;/code&gt; 
tpchgen  generates the entire
 &lt;/tr&gt;
 &lt;/table&gt;
 &lt;p&gt;&lt;strong&gt;Table 4&lt;/strong&gt;: tpchgen-cli (multithreaded) 
performance measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
-&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory use does not increase with increasing data size / scale 
factors[^5]. The DuckDB generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on developer laptops and 
memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt; it is perfectly possible to cr [...]
+&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory
+use does not increase with increasing data size / scale factors[^5]. The DuckDB
+generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on
+developer laptops and memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt;
+it is perfectly possible to create data for SF=10000 or larger on a machine 
with
+16GB of memory (assuming sufficient storage capacity).&lt;/p&gt;
 &lt;h2&gt;Direct to parquet&lt;/h2&gt;
-&lt;p&gt;At this point, tpchgen-cli could very quickly generate the TBL 
format. However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
+&lt;p&gt;At this point, &lt;code&gt;tpchgen-cli&lt;/code&gt; could very 
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;It has no header&lt;/li&gt;
 &lt;li&gt;It is like a CSV but the delimiter is &amp;lsquo;|`&lt;/li&gt;
 &lt;li&gt;Each line ends with an extra &lt;code&gt;|&lt;/code&gt; delimiter 
before the newline 🙄&lt;/li&gt;
 &lt;li&gt;No system that we know can read them without additional 
configuration.&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;niebayes&lt;/a&gt; from Datalayers for 
finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/66"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;several&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which  [...]
-&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as Apache Parquet which is supported by all modern query engines. So we 
&lt;a href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; 
a &lt;a 
href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; crate 
to create&lt;a href="https://arrow.apache.org/"&gt; Apache Arrow&lt;/a&gt; 
arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a smal [...]
-&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though creating Parquet files is significantly more computationally 
expensive than TBL or CSV, tpchgen-cli creates the full SF=100 parquet format 
dataset in less than 45 seconds.&lt;/p&gt;
+&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;@niebayes&lt;/a&gt; from
+Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still 
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPCH benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
+Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
+&lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
+300 line PR&lt;/a&gt; to feed those arrays to the &lt;a 
href="https://crates.io/crates/parquet"&gt;Rust Parquet writer&lt;/a&gt;, again 
using
+tokio for parallelized but memory bound work.&lt;/p&gt;
+&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.&lt;/p&gt;
+&lt;p&gt;&lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt;(https://arrow.apache.org/&lt;/p&gt;
 &lt;table&gt;
 &lt;tr&gt;
 &lt;td&gt;Scale Factor
@@ -441,10 +447,19 @@ When writing to &lt;code&gt;/dev/null&lt;/code&gt; 
tpchgen  generates the entire
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: tpchgen-cli parquet generation 
performance measured with  &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR --format=parquet&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: 
&lt;code&gt;tpchgen-cli&lt;/code&gt; Parquet generation performance measured 
with  &lt;code&gt;time
+target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&gt;
 &lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
-&lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made something 10x better than currently exists. We hope it inspires more 
research into analytical systems using the TPCH dataset and that people build 
awesome things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of 
tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at 
&lt;a href="https://github.com/clflushopt [...]
-&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you are looking for a community of like minded people hacking on 
databases, we welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into DataFusion 
(see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
+&lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made
+something 10x better than currently exists. We hope it inspires more research
+into analytical systems using the TPCH dataset and that people build awesome
+things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
+tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into
+DataFusion (see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
 &lt;!-- Footnotes themselves at the bottom. --&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
 &lt;p&gt;[^2]: It is possible to embed the dbgen code, which appears to be the 
approach taken by DuckDB. This approach was tried in GlareDB (&lt;a 
href="https://github.com/GlareDB/glaredb/pull/3313"&gt;GlareDB/glaredb#3313&lt;/a&gt;),
 but ultimately shelved given the amount of effort needed to adapt and isolate 
the dbgen code.&lt;/p&gt;
diff --git a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml 
b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
index 8b065ef..598f20d 100644
--- a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
+++ b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
Andrew Lamb, Achraf B, and Sean Smith</title><link 
href="https://datafusion.apache.org/blog/"; rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml";
 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs:
 World’s fastest open source TPCH data generator,  [...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
Andrew Lamb, Achraf B, and Sean Smith</title><link 
href="https://datafusion.apache.org/blog/"; rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml";
 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs`
 World’s fastest open source TPCH data generator, [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -57,7 +57,7 @@ th, td {
 &lt;p&gt;We used Rust and open source development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully open
 TPCH data generator over 10x faster than any other implementation  we know 
of.&lt;/p&gt;
 &lt;p&gt;About the Authors:
-- &lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt;) and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.
+- &lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt; and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.
 - Achraf B (&lt;a 
href="https://github.com/clflushopt"&gt;@clflushopt&lt;/a&gt;) is a Software 
Engineer at &lt;a href="https://optable.co/"&gt;Optable&lt;/a&gt; where he 
works on data infrastructure.
 - &lt;a href="https://www.linkedin.com/in/scsmithr/"&gt;Sean Smith&lt;/a&gt; 
(&lt;a href="https://github.com/scsmithr"&gt;@scsmithr&lt;/a&gt;) is the 
founder of  focused on building a fast analytics database.&lt;/p&gt;
 &lt;p&gt;It is now possible to create the TPCH SF=100 dataset in 72.23 seconds 
(1.4 GB/s
@@ -70,16 +70,14 @@ analytical engines such as DataFusion.&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPCH dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
 VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
-&lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000,
-&lt;code&gt;tpchgen&lt;/code&gt; takes 10 minutes and 26 and uses about 5 GB 
of RAM at peak, and we
-could not measure DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of
-RAM&lt;/a&gt;,
-more than the 88 GB that was available on our test machine. The testing
-methodology is in the
+&lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
 &lt;p&gt;This blog explains what TPCH is, how we ported the vintage C data 
generator to
 Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
-of part time work. We began this project so we can easily generate TPCH data in
+of part-time work. We began this project so we can easily generate TPCH data in
 &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Try if for yourself&lt;/h1&gt;
 &lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
@@ -92,33 +90,31 @@ $ tpchgen-cli -s &lt;span class="m"&gt;1&lt;/span&gt;
 $ tpchgen-cli -s &lt;span class="m"&gt;10&lt;/span&gt; --format&lt;span 
class="o"&gt;=&lt;/span&gt;parquet
 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
 &lt;h1&gt;What is TPCH / dbgen?&lt;/h1&gt;
-&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to
-as TPCH) helps evaluate the performance of database systems on
-&lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;,
-&lt;/em&gt;the kind used to build BI dashboards.&lt;/p&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;, &lt;/em&gt;the kind used to build BI
+dashboards.&lt;/p&gt;
 &lt;p&gt;TPCH has become a de facto standard for analytic systems. While there 
are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
-known&lt;/a&gt; limitations as the data
-and queries do not well represent many real world use cases, the majority of
-analytic database papers and industrial systems still use TPCH query 
performance
-benchmarks as a baseline. You will inevitably find multiple results for  
&amp;ldquo;&lt;code&gt;TPCH
-Performance &amp;amp;lt;your favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; 
in any search engine.&lt;/p&gt;
+known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
+search engine.&lt;/p&gt;
 &lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
-systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance
-Council&lt;/a&gt; defined a process of formal result verification.
-More recently, given the broad availability of free and open source database
-systems, it is common for users to run and verify TPCH performance 
themselves.&lt;/p&gt;
+systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.&lt;/p&gt;
 &lt;p&gt;TPCH simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
 &lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
 tables are linked by foreign keys in a normalized schema representing a supply
 chain with parts, suppliers, customers and orders. The benchmark itself is 22
 SQL queries containing joins, aggregations, and sorting operations.&lt;/p&gt;
-&lt;p&gt;The queries run against data created with
-&lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program written
-in a pre &lt;a href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; 
dialect, which generates data
-in a format called &lt;em&gt;TBL&lt;/em&gt; (example in Figure 2). 
&lt;code&gt;dbgen&lt;/code&gt; creates data for each of
-the 8 tables for a certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly 
abbreviated as SF. Example
-Scale Factors and corresponding dataset sizes are shown in Table 1. There is no
-theoretical upper bound on the Scale Factor.&lt;/p&gt;
+&lt;p&gt;The queries run against data created with &lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen)"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program
+written in a pre &lt;a 
href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; dialect, which 
generates data in a format called &lt;em&gt;TBL&lt;/em&gt;
+(example in Figure 2). &lt;code&gt;dbgen&lt;/code&gt; creates data for each of 
the 8 tables for a
+certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly abbreviated as SF. Example 
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.&lt;/p&gt;
 &lt;div 
class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
 
229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
 
263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
@@ -178,7 +174,9 @@ theoretical upper bound on the Scale Factor.&lt;/p&gt;
 &lt;/table&gt;
 &lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPCH data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Why do we need a new TPCH Data generator?&lt;/h1&gt;
-&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it is used frequently in database performance analysis. To run TPCH, 
you must first load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is 
not ideal for several reasons:&lt;/p&gt;
+&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPCH, you must 
first
+load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
 &lt;li&gt;&lt;code&gt;dbgen&lt;/code&gt; requires substantial time (Figure 3) 
and is not able to use more than one core.&lt;/li&gt;
@@ -191,53 +189,48 @@ shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; 
restricted to a single core is s
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-pre-loaded TPCH data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample
-Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
-Sample
-datasets&lt;/a&gt; and
-&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data
-Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a
-&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH&lt;/a&gt; 
extension for
-generating TPCH datasets within DuckDB. This is so much easier to use than the
-current alternatives that it leads many researchers and other thought leaders 
to
-use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
-Lim&lt;/a&gt; explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the
-TPCH dataset&lt;/a&gt; as one reason
-the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring
-2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
-&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non ideal for 
several reasons:&lt;/p&gt;
+pre-loaded TPCH data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH 
extension&lt;/a&gt; 
+for generating TPCH datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
+Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPCH dataset&lt;/a&gt; as one reason
+the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
 &lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
 &lt;li&gt;Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale 
Factor 10)&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and above on laptops or standard workstations, though DuckDB offers&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;
 pre-computed files&lt;/a&gt; for larger factors[^3].&lt;/p&gt;
+&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and
+above on laptops or standard workstations, though DuckDB offers &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;pre-computed
+files&lt;/a&gt; for larger factors[^3].&lt;/p&gt;
 &lt;h1&gt;Why Rust?&lt;/h1&gt;
-&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is superior to C/C++ due to its comparable performance, but much higher 
programmer productivity (Figure 4). Productivity in this case refers to the 
ease of optimizing and adding multithreading without introducing hard to debug 
memory sa [...]
-&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for example), when required for performance, our implementation is 
entirely memory safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII strin [...]
+&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator
+into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.&lt;/p&gt;
+&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII
+strings.&lt;/p&gt;
 &lt;p&gt;&lt;img alt="Lamb Theory on Evolution of Systems Languages" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/lamb-theory.png" 
width="80%"/&gt;&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 4&lt;/strong&gt;: Lamb Theory of System Language 
Evolution from &lt;a 
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf"&gt;Boston 
University
 MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtube.com/watch?v=CpnxuBwHbUc"&gt;recording&lt;/a&gt;. 
Special
 thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
 &lt;h1&gt;How: The Journey&lt;/h1&gt;
 &lt;p&gt;We did it together as a team in the open over the course of a few 
weeks weeks.
-&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out
-the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH dataset
-creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we
-check out a Java port on February 11,
-2025&lt;/a&gt;.
-Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days
-later&lt;/a&gt;
-on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
 and Sean started helping on March 8,
-2025&lt;/a&gt;
-and we &lt;a href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 
0.1&lt;/a&gt; on March
-30, 2025.&lt;/p&gt;
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH
+dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
+2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
+and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
+March 30, 2025.&lt;/p&gt;
 &lt;h2&gt;Optimizing Single Threaded Performance&lt;/h2&gt;
-&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance
-tests&lt;/a&gt;, to ensure correctness,
-and an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt;
-on March 15, 2025.&lt;/p&gt;
+&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance tests&lt;/a&gt;, to ensure correctness, and
+an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt; on March 15, 2025.&lt;/p&gt;
 &lt;p&gt;On a Macbook Pro M3 (Nov 2023), the initial performance numbers were 
actually
 slower than the original Java implementation which was ported 😭. This 
wasn&amp;rsquo;t
 surprising since the focus of the first version was to get a byte of byte
@@ -269,8 +262,7 @@ approach them.&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial
-tpchgen-cli&lt;/a&gt;, measured with
+&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial 
tpchgen-cli&lt;/a&gt;, measured with
 &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
 &lt;p&gt;With this strong foundation we began optimizing the code using 
Rust&amp;rsquo;s low level
 memory management to improve performance while retaining memory safely. We 
spent
@@ -319,15 +311,11 @@ list of optimizations:&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Current single threaded  
tpchgen-cli, measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR --num-threads=1&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Single threaded 
&lt;code&gt;tpchgen-cli&lt;/code&gt; performance, measured with 
&lt;code&gt;time target/release/tpchgen-cli -s $SCALE_FACTOR 
--num-threads=1&lt;/code&gt;&lt;/p&gt;
 &lt;h2&gt;Multi-threading&lt;/h2&gt;
-&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless
-concurrency&lt;/a&gt; &amp;ndash; with a
-single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR (272 net new lines)
-&lt;/a&gt;we
-updated the same memory safe code to run with multiple threads and consume
-bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread
-scheduler&lt;/a&gt;[^4].&lt;/p&gt;
+&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless concurrency&lt;/a&gt; &amp;ndash; with a single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR&lt;/a&gt; (272
+net new lines) we updated the same memory safe code to run with multiple 
threads
+and consume bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread scheduler&lt;/a&gt;)[^4].&lt;/p&gt;
 &lt;p&gt;As shown in Table 4, with this change, tpchgen-cli generates the full 
SF=100
 dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
 that at SF=100 our generator is actually IO bound (which is not the case for
@@ -387,18 +375,36 @@ When writing to &lt;code&gt;/dev/null&lt;/code&gt; 
tpchgen  generates the entire
 &lt;/tr&gt;
 &lt;/table&gt;
 &lt;p&gt;&lt;strong&gt;Table 4&lt;/strong&gt;: tpchgen-cli (multithreaded) 
performance measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
-&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory use does not increase with increasing data size / scale 
factors[^5]. The DuckDB generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on developer laptops and 
memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt; it is perfectly possible to cr [...]
+&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory
+use does not increase with increasing data size / scale factors[^5]. The DuckDB
+generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on
+developer laptops and memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt;
+it is perfectly possible to create data for SF=10000 or larger on a machine 
with
+16GB of memory (assuming sufficient storage capacity).&lt;/p&gt;
 &lt;h2&gt;Direct to parquet&lt;/h2&gt;
-&lt;p&gt;At this point, tpchgen-cli could very quickly generate the TBL 
format. However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
+&lt;p&gt;At this point, &lt;code&gt;tpchgen-cli&lt;/code&gt; could very 
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;It has no header&lt;/li&gt;
 &lt;li&gt;It is like a CSV but the delimiter is &amp;lsquo;|`&lt;/li&gt;
 &lt;li&gt;Each line ends with an extra &lt;code&gt;|&lt;/code&gt; delimiter 
before the newline 🙄&lt;/li&gt;
 &lt;li&gt;No system that we know can read them without additional 
configuration.&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;niebayes&lt;/a&gt; from Datalayers for 
finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/66"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;several&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which  [...]
-&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as Apache Parquet which is supported by all modern query engines. So we 
&lt;a href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; 
a &lt;a 
href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; crate 
to create&lt;a href="https://arrow.apache.org/"&gt; Apache Arrow&lt;/a&gt; 
arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a smal [...]
-&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though creating Parquet files is significantly more computationally 
expensive than TBL or CSV, tpchgen-cli creates the full SF=100 parquet format 
dataset in less than 45 seconds.&lt;/p&gt;
+&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;@niebayes&lt;/a&gt; from
+Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still 
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPCH benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
+Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
+&lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
+300 line PR&lt;/a&gt; to feed those arrays to the &lt;a 
href="https://crates.io/crates/parquet"&gt;Rust Parquet writer&lt;/a&gt;, again 
using
+tokio for parallelized but memory bound work.&lt;/p&gt;
+&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.&lt;/p&gt;
+&lt;p&gt;&lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt;(https://arrow.apache.org/&lt;/p&gt;
 &lt;table&gt;
 &lt;tr&gt;
 &lt;td&gt;Scale Factor
@@ -441,10 +447,19 @@ When writing to &lt;code&gt;/dev/null&lt;/code&gt; 
tpchgen  generates the entire
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: tpchgen-cli parquet generation 
performance measured with  &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR --format=parquet&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: 
&lt;code&gt;tpchgen-cli&lt;/code&gt; Parquet generation performance measured 
with  &lt;code&gt;time
+target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&gt;
 &lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
-&lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made something 10x better than currently exists. We hope it inspires more 
research into analytical systems using the TPCH dataset and that people build 
awesome things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of 
tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at 
&lt;a href="https://github.com/clflushopt [...]
-&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you are looking for a community of like minded people hacking on 
databases, we welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into DataFusion 
(see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
+&lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made
+something 10x better than currently exists. We hope it inspires more research
+into analytical systems using the TPCH dataset and that people build awesome
+things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
+tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into
+DataFusion (see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
 &lt;!-- Footnotes themselves at the bottom. --&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
 &lt;p&gt;[^2]: It is possible to embed the dbgen code, which appears to be the 
approach taken by DuckDB. This approach was tried in GlareDB (&lt;a 
href="https://github.com/GlareDB/glaredb/pull/3313"&gt;GlareDB/glaredb#3313&lt;/a&gt;),
 but ultimately shelved given the amount of effort needed to adapt and isolate 
the dbgen code.&lt;/p&gt;
diff --git a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml 
b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
index e2f8309..4d9df7a 100644
--- a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
+++ b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion Blog - Andrew Lamb, 
Achraf B, and Sean 
Smith</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs: World’s 
fastest open source TPCH data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
+<rss version="2.0"><channel><title>Apache DataFusion Blog - Andrew Lamb, 
Achraf B, and Sean 
Smith</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>`tpchgen-rs` World’s 
fastest open source TPCH data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml
index 8b77af1..16fb5a4 100644
--- a/blog/feeds/blog.atom.xml
+++ b/blog/feeds/blog.atom.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs:
 World’s fastest open source TPCH data generator, written in Rust</title><link 
href="https://datafusion.apache.org [...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs`
 World’s fastest open source TPCH data generator, written in Rust</title><link 
href="https://datafusion.apache.or [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -57,7 +57,7 @@ th, td {
 &lt;p&gt;We used Rust and open source development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully open
 TPCH data generator over 10x faster than any other implementation  we know 
of.&lt;/p&gt;
 &lt;p&gt;About the Authors:
-- &lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt;) and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.
+- &lt;a href="https://www.linkedin.com/in/andrewalamb/"&gt;Andrew 
Lamb&lt;/a&gt; (&lt;a href="https://github.com/alamb"&gt;@alamb&lt;/a&gt;) is a 
Staff Engineer at &lt;a 
href="https://www.influxdata.com/"&gt;InfluxData&lt;/a&gt; and a PMC member of 
&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt;.
 - Achraf B (&lt;a 
href="https://github.com/clflushopt"&gt;@clflushopt&lt;/a&gt;) is a Software 
Engineer at &lt;a href="https://optable.co/"&gt;Optable&lt;/a&gt; where he 
works on data infrastructure.
 - &lt;a href="https://www.linkedin.com/in/scsmithr/"&gt;Sean Smith&lt;/a&gt; 
(&lt;a href="https://github.com/scsmithr"&gt;@scsmithr&lt;/a&gt;) is the 
founder of  focused on building a fast analytics database.&lt;/p&gt;
 &lt;p&gt;It is now possible to create the TPCH SF=100 dataset in 72.23 seconds 
(1.4 GB/s
@@ -70,16 +70,14 @@ analytical engines such as DataFusion.&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPCH dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
 VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
-&lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000,
-&lt;code&gt;tpchgen&lt;/code&gt; takes 10 minutes and 26 and uses about 5 GB 
of RAM at peak, and we
-could not measure DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of
-RAM&lt;/a&gt;,
-more than the 88 GB that was available on our test machine. The testing
-methodology is in the
+&lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
 &lt;p&gt;This blog explains what TPCH is, how we ported the vintage C data 
generator to
 Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
-of part time work. We began this project so we can easily generate TPCH data in
+of part-time work. We began this project so we can easily generate TPCH data in
 &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Try if for yourself&lt;/h1&gt;
 &lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
@@ -92,33 +90,31 @@ $ tpchgen-cli -s &lt;span class="m"&gt;1&lt;/span&gt;
 $ tpchgen-cli -s &lt;span class="m"&gt;10&lt;/span&gt; --format&lt;span 
class="o"&gt;=&lt;/span&gt;parquet
 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
 &lt;h1&gt;What is TPCH / dbgen?&lt;/h1&gt;
-&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to
-as TPCH) helps evaluate the performance of database systems on
-&lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;,
-&lt;/em&gt;the kind used to build BI dashboards.&lt;/p&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;, &lt;/em&gt;the kind used to build BI
+dashboards.&lt;/p&gt;
 &lt;p&gt;TPCH has become a de facto standard for analytic systems. While there 
are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
-known&lt;/a&gt; limitations as the data
-and queries do not well represent many real world use cases, the majority of
-analytic database papers and industrial systems still use TPCH query 
performance
-benchmarks as a baseline. You will inevitably find multiple results for  
&amp;ldquo;&lt;code&gt;TPCH
-Performance &amp;amp;lt;your favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; 
in any search engine.&lt;/p&gt;
+known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
+search engine.&lt;/p&gt;
 &lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
-systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance
-Council&lt;/a&gt; defined a process of formal result verification.
-More recently, given the broad availability of free and open source database
-systems, it is common for users to run and verify TPCH performance 
themselves.&lt;/p&gt;
+systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.&lt;/p&gt;
 &lt;p&gt;TPCH simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
 &lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
 tables are linked by foreign keys in a normalized schema representing a supply
 chain with parts, suppliers, customers and orders. The benchmark itself is 22
 SQL queries containing joins, aggregations, and sorting operations.&lt;/p&gt;
-&lt;p&gt;The queries run against data created with
-&lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program written
-in a pre &lt;a href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; 
dialect, which generates data
-in a format called &lt;em&gt;TBL&lt;/em&gt; (example in Figure 2). 
&lt;code&gt;dbgen&lt;/code&gt; creates data for each of
-the 8 tables for a certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly 
abbreviated as SF. Example
-Scale Factors and corresponding dataset sizes are shown in Table 1. There is no
-theoretical upper bound on the Scale Factor.&lt;/p&gt;
+&lt;p&gt;The queries run against data created with &lt;code&gt;&lt;a 
href="https://github.com/electrum/tpch-dbgen)"&gt;dbgen&lt;/a&gt;&lt;/code&gt;, 
a program
+written in a pre &lt;a 
href="https://en.wikipedia.org/wiki/C99"&gt;C-99&lt;/a&gt; dialect, which 
generates data in a format called &lt;em&gt;TBL&lt;/em&gt;
+(example in Figure 2). &lt;code&gt;dbgen&lt;/code&gt; creates data for each of 
the 8 tables for a
+certain &lt;em&gt;Scale Factor&lt;/em&gt;, commonly abbreviated as SF. Example 
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.&lt;/p&gt;
 &lt;div 
class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
 
229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
 
263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
@@ -178,7 +174,9 @@ theoretical upper bound on the Scale Factor.&lt;/p&gt;
 &lt;/table&gt;
 &lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPCH data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Why do we need a new TPCH Data generator?&lt;/h1&gt;
-&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it is used frequently in database performance analysis. To run TPCH, 
you must first load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is 
not ideal for several reasons:&lt;/p&gt;
+&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPCH, you must 
first
+load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
 &lt;li&gt;&lt;code&gt;dbgen&lt;/code&gt; requires substantial time (Figure 3) 
and is not able to use more than one core.&lt;/li&gt;
@@ -191,53 +189,48 @@ shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; 
restricted to a single core is s
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-pre-loaded TPCH data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample
-Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
-Sample
-datasets&lt;/a&gt; and
-&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data
-Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a
-&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH&lt;/a&gt; 
extension for
-generating TPCH datasets within DuckDB. This is so much easier to use than the
-current alternatives that it leads many researchers and other thought leaders 
to
-use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
-Lim&lt;/a&gt; explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the
-TPCH dataset&lt;/a&gt; as one reason
-the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring
-2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
-&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non ideal for 
several reasons:&lt;/p&gt;
+pre-loaded TPCH data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH 
extension&lt;/a&gt; 
+for generating TPCH datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
+Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPCH dataset&lt;/a&gt; as one reason
+the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
 &lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
 &lt;li&gt;Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale 
Factor 10)&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and above on laptops or standard workstations, though DuckDB offers&lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;
 pre-computed files&lt;/a&gt; for larger factors[^3].&lt;/p&gt;
+&lt;p&gt;The above limitations makes it impractical to generate Scale Factor 
100 and
+above on laptops or standard workstations, though DuckDB offers &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;pre-computed
+files&lt;/a&gt; for larger factors[^3].&lt;/p&gt;
 &lt;h1&gt;Why Rust?&lt;/h1&gt;
-&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is superior to C/C++ due to its comparable performance, but much higher 
programmer productivity (Figure 4). Productivity in this case refers to the 
ease of optimizing and adding multithreading without introducing hard to debug 
memory sa [...]
-&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for example), when required for performance, our implementation is 
entirely memory safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII strin [...]
+&lt;p&gt;Realistically we used Rust because we wanted to integrate the data 
generator
+into &lt;a href="https://datafusion.apache.org/"&gt;Apache 
DataFusion&lt;/a&gt; and &lt;a 
href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;. However, we also believe Rust 
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.&lt;/p&gt;
+&lt;p&gt;While Rust does allow unsafe access to memory (eliding bounds 
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only &lt;a 
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;amp;type=code"&gt;unsafe&lt;/a&gt;
 code is used to &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72"&gt;skip&lt;/a&gt;
 UTF8 validation on known ASCII
+strings.&lt;/p&gt;
 &lt;p&gt;&lt;img alt="Lamb Theory on Evolution of Systems Languages" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/lamb-theory.png" 
width="80%"/&gt;&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 4&lt;/strong&gt;: Lamb Theory of System Language 
Evolution from &lt;a 
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf"&gt;Boston 
University
 MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtube.com/watch?v=CpnxuBwHbUc"&gt;recording&lt;/a&gt;. 
Special
 thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
 &lt;h1&gt;How: The Journey&lt;/h1&gt;
 &lt;p&gt;We did it together as a team in the open over the course of a few 
weeks weeks.
-&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out
-the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH dataset
-creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we
-check out a Java port on February 11,
-2025&lt;/a&gt;.
-Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days
-later&lt;/a&gt;
-on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
 and Sean started helping on March 8,
-2025&lt;/a&gt;
-and we &lt;a href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 
0.1&lt;/a&gt; on March
-30, 2025.&lt;/p&gt;
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH
+dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
+2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
+and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
+March 30, 2025.&lt;/p&gt;
 &lt;h2&gt;Optimizing Single Threaded Performance&lt;/h2&gt;
-&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance
-tests&lt;/a&gt;, to ensure correctness,
-and an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt;
-on March 15, 2025.&lt;/p&gt;
+&lt;p&gt;Archaf &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/16"&gt;completed the end to 
end conformance tests&lt;/a&gt;, to ensure correctness, and
+an initial &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;cli check 
in&lt;/a&gt; on March 15, 2025.&lt;/p&gt;
 &lt;p&gt;On a Macbook Pro M3 (Nov 2023), the initial performance numbers were 
actually
 slower than the original Java implementation which was ported 😭. This 
wasn&amp;rsquo;t
 surprising since the focus of the first version was to get a byte of byte
@@ -269,8 +262,7 @@ approach them.&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial
-tpchgen-cli&lt;/a&gt;, measured with
+&lt;p&gt;&lt;strong&gt;Table 2&lt;/strong&gt;: Performance of running &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/12"&gt;the initial 
tpchgen-cli&lt;/a&gt;, measured with
 &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
 &lt;p&gt;With this strong foundation we began optimizing the code using 
Rust&amp;rsquo;s low level
 memory management to improve performance while retaining memory safely. We 
spent
@@ -319,15 +311,11 @@ list of optimizations:&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Current single threaded  
tpchgen-cli, measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR --num-threads=1&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Table 3&lt;/strong&gt;: Single threaded 
&lt;code&gt;tpchgen-cli&lt;/code&gt; performance, measured with 
&lt;code&gt;time target/release/tpchgen-cli -s $SCALE_FACTOR 
--num-threads=1&lt;/code&gt;&lt;/p&gt;
 &lt;h2&gt;Multi-threading&lt;/h2&gt;
-&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless
-concurrency&lt;/a&gt; &amp;ndash; with a
-single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR (272 net new lines)
-&lt;/a&gt;we
-updated the same memory safe code to run with multiple threads and consume
-bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread
-scheduler&lt;/a&gt;[^4].&lt;/p&gt;
+&lt;p&gt;Then we applied &lt;a 
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html"&gt;Rust&amp;rsquo;s
 fearless concurrency&lt;/a&gt; &amp;ndash; with a single, &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499"&gt;small
 PR&lt;/a&gt; (272
+net new lines) we updated the same memory safe code to run with multiple 
threads
+and consume bounded memory using &lt;a 
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;tokio
 for the thread scheduler&lt;/a&gt;)[^4].&lt;/p&gt;
 &lt;p&gt;As shown in Table 4, with this change, tpchgen-cli generates the full 
SF=100
 dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
 that at SF=100 our generator is actually IO bound (which is not the case for
@@ -387,18 +375,36 @@ When writing to &lt;code&gt;/dev/null&lt;/code&gt; 
tpchgen  generates the entire
 &lt;/tr&gt;
 &lt;/table&gt;
 &lt;p&gt;&lt;strong&gt;Table 4&lt;/strong&gt;: tpchgen-cli (multithreaded) 
performance measured with &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR&lt;/code&gt;&lt;/p&gt;
-&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory use does not increase with increasing data size / scale 
factors[^5]. The DuckDB generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on developer laptops and 
memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt; it is perfectly possible to cr [...]
+&lt;p&gt;Using Rust and async streams, the data generator is also fully 
streaming: memory
+use does not increase with increasing data size / scale factors[^5]. The DuckDB
+generator seems to &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;require
 far more memory&lt;/a&gt; than is commonly available on
+developer laptops and memory use increases with scale factor. With 
&lt;code&gt;tpchgen-cli&lt;/code&gt;
+it is perfectly possible to create data for SF=10000 or larger on a machine 
with
+16GB of memory (assuming sufficient storage capacity).&lt;/p&gt;
 &lt;h2&gt;Direct to parquet&lt;/h2&gt;
-&lt;p&gt;At this point, tpchgen-cli could very quickly generate the TBL 
format. However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
+&lt;p&gt;At this point, &lt;code&gt;tpchgen-cli&lt;/code&gt; could very 
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with, 
because&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;It has no header&lt;/li&gt;
 &lt;li&gt;It is like a CSV but the delimiter is &amp;lsquo;|`&lt;/li&gt;
 &lt;li&gt;Each line ends with an extra &lt;code&gt;|&lt;/code&gt; delimiter 
before the newline 🙄&lt;/li&gt;
 &lt;li&gt;No system that we know can read them without additional 
configuration.&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;niebayes&lt;/a&gt; from Datalayers for 
finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/66"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;several&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which  [...]
-&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as Apache Parquet which is supported by all modern query engines. So we 
&lt;a href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; 
a &lt;a 
href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; crate 
to create&lt;a href="https://arrow.apache.org/"&gt; Apache Arrow&lt;/a&gt; 
arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a smal [...]
-&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though creating Parquet files is significantly more computationally 
expensive than TBL or CSV, tpchgen-cli creates the full SF=100 parquet format 
dataset in less than 45 seconds.&lt;/p&gt;
+&lt;p&gt;We next &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/54"&gt;added support for 
CSV&lt;/a&gt; generation (special thanks &lt;a 
href="https://github.com/niebayes"&gt;@niebayes&lt;/a&gt; from
+Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still 
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPCH benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+&lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
+Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
+&lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
+300 line PR&lt;/a&gt; to feed those arrays to the &lt;a 
href="https://crates.io/crates/parquet"&gt;Rust Parquet writer&lt;/a&gt;, again 
using
+tokio for parallelized but memory bound work.&lt;/p&gt;
+&lt;p&gt;This approach was simple, fast and scalable, as shown in Table 5. 
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.&lt;/p&gt;
+&lt;p&gt;&lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt;(https://arrow.apache.org/&lt;/p&gt;
 &lt;table&gt;
 &lt;tr&gt;
 &lt;td&gt;Scale Factor
@@ -441,10 +447,19 @@ When writing to &lt;code&gt;/dev/null&lt;/code&gt; 
tpchgen  generates the entire
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: tpchgen-cli parquet generation 
performance measured with  &lt;code&gt;time target/release/tpchgen-cli -s 
$SCALE_FACTOR --format=parquet&lt;/code&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Table 5&lt;/strong&gt;: 
&lt;code&gt;tpchgen-cli&lt;/code&gt; Parquet generation performance measured 
with  &lt;code&gt;time
+target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&gt;
 &lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
-&lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made something 10x better than currently exists. We hope it inspires more 
research into analytical systems using the TPCH dataset and that people build 
awesome things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of 
tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at 
&lt;a href="https://github.com/clflushopt [...]
-&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you are looking for a community of like minded people hacking on 
databases, we welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into DataFusion 
(see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
+&lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made
+something 10x better than currently exists. We hope it inspires more research
+into analytical systems using the TPCH dataset and that people build awesome
+things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
+tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
+&lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;We met while working together on Apache DataFusion in various 
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to &lt;a 
href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;come
 join us&lt;/a&gt;. We are in the process of integrating this into
+DataFusion (see &lt;a 
href="https://github.com/apache/datafusion/issues/14608"&gt;apache/datafusion#14608&lt;/a&gt;)
 if you are interested in helping 🎣&lt;/p&gt;
 &lt;!-- Footnotes themselves at the bottom. --&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
 &lt;p&gt;[^2]: It is possible to embed the dbgen code, which appears to be the 
approach taken by DuckDB. This approach was tried in GlareDB (&lt;a 
href="https://github.com/GlareDB/glaredb/pull/3313"&gt;GlareDB/glaredb#3313&lt;/a&gt;),
 but ultimately shelved given the amount of effort needed to adapt and isolate 
the dbgen code.&lt;/p&gt;
diff --git a/blog/index.html b/blog/index.html
index 7a307d8..1606a7a 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -50,7 +50,7 @@
             <article class="post">
                 <header>
                     <div class="title">
-                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs: World’s fastest open 
source TPCH data generator, written in Rust</a></h1>
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest 
open source TPCH data generator, written in Rust</a></h1>
                         <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
                         <p><!--
 {% comment %}


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

Reply via email to