This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new c1083e0 Commit build products
c1083e0 is described below
commit c1083e0d1a0b08c71253a8cfea7839869b73546c
Author: Build Pelican (action) <[email protected]>
AuthorDate: Thu Apr 10 15:41:29 2025 +0000
Commit build products
---
.../2025/04/10/fastest-tpch-generator/index.html | 501 +++++++++++++++++++++
.../andrew-lamb-achraf-b-and-sean-smith.html | 116 +++++
output/category/blog.html | 47 ++
output/feed.xml | 30 +-
output/feeds/all-en.atom.xml | 470 ++++++++++++++++++-
.../andrew-lamb-achraf-b-and-sean-smith.atom.xml | 470 +++++++++++++++++++
.../andrew-lamb-achraf-b-and-sean-smith.rss.xml | 30 ++
output/feeds/blog.atom.xml | 470 ++++++++++++++++++-
.../images/fastest-tpch-generator/lamb-theory.png | Bin 0 -> 300479 bytes
.../fastest-tpch-generator/parquet-performance.png | Bin 0 -> 61946 bytes
.../fastest-tpch-generator/tbl-performance.png | Bin 0 -> 49477 bytes
output/index.html | 47 ++
12 files changed, 2178 insertions(+), 3 deletions(-)
diff --git a/output/2025/04/10/fastest-tpch-generator/index.html
b/output/2025/04/10/fastest-tpch-generator/index.html
new file mode 100644
index 0000000..e03c029
--- /dev/null
+++ b/output/2025/04/10/fastest-tpch-generator/index.html
@@ -0,0 +1,501 @@
+<!doctype html>
+<html class="no-js" lang="en" dir="ltr">
+ <head>
+ <meta charset="utf-8">
+ <meta http-equiv="x-ua-compatible" content="ie=edge">
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
+ <title>tpchgen-rs World’s fastest open source TPC-H data generator,
written in Rust - Apache DataFusion Blog</title>
+<link href="/blog/css/bootstrap.min.css" rel="stylesheet">
+<link href="/blog/css/fontawesome.all.min.css" rel="stylesheet">
+<link href="/blog/css/headerlink.css" rel="stylesheet">
+<link href="/blog/highlight/default.min.css" rel="stylesheet">
+<script src="/blog/highlight/highlight.js"></script>
+<script>hljs.highlightAll();</script> </head>
+ <body class="d-flex flex-column h-100">
+ <main class="flex-shrink-0">
+<!-- nav bar -->
+<nav class="navbar navbar-expand-lg navbar-dark bg-dark" aria-label="Fifth
navbar example">
+ <div class="container-fluid">
+ <a class="navbar-brand" href="/blog"><img
src="/blog/images/logo_original4x.png" style="height: 32px;"/> Apache
DataFusion Blog</a>
+ <button class="navbar-toggler" type="button" data-bs-toggle="collapse"
data-bs-target="#navbarADP" aria-controls="navbarADP" aria-expanded="false"
aria-label="Toggle navigation">
+ <span class="navbar-toggler-icon"></span>
+ </button>
+
+ <div class="collapse navbar-collapse" id="navbarADP">
+ <ul class="navbar-nav me-auto mb-2 mb-lg-0">
+ <li class="nav-item">
+ <a class="nav-link" href="/blog/about.html">About</a>
+ </li>
+ <li class="nav-item">
+ <a class="nav-link" href="/blog/feed.xml">RSS</a>
+ </li>
+ </ul>
+ </div>
+ </div>
+</nav>
+
+
+<!-- page contents -->
+<div id="contents">
+ <div class="bg-white p-5 rounded">
+ <div class="col-sm-8 mx-auto">
+ <h1>
+ tpchgen-rs World’s fastest open source TPC-H data generator,
written in Rust
+ </h1>
+ <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf B, and
Sean Smith</p>
+ <!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with
dbgen</strong>.</p>
+<p>3 members of the <a href="https://datafusion.apache.org/">Apache
DataFusion</a> community used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a fully open
TPC-H data generator over 20x
+faster than any other implementation we know of.</p>
+<p>It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic
<code>dbgen</code>
+which takes 30 minutes<sup>1</sup> (0.05GB/sec). On the same machine, it takes
less than
+2 minutes to create all 3.6 GB of SF=100 in <a
href="https://parquet.apache.org/">Apache Parquet</a> format, which takes 44
minutes using <a href="https://duckdb.org">DuckDB</a>.
+It is finally convenient and efficient to run TPC-H queries locally when
testing
+analytical engines such as DataFusion.</p>
+<p><img alt="Time to create TPC-H parquet dataset for Scale Factor 1, 10, 100
and 1000" class="img-responsive"
src="/blog/images/fastest-tpch-generator/parquet-performance.png"
width="80%"/></p>
+<p><strong>Figure 1</strong>: Time to create TPC-H dataset for Scale Factor
(see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1
minute and 14 seconds and
+<a href="https://duckdb.org">DuckDB</a> takes 17 minutes and 48 seconds. For
SF=1000, <code>tpchgen</code> takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">requires
647 GB of RAM</a>, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+<a
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md">documentation</a>.</p>
+<p>This blog explains what TPC-H is, how we ported the vintage C data
generator to
+Rust (yes, <a
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/">RWIR</a>)
and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data
in
+<a href="https://datafusion.apache.org/">Apache DataFusion</a> and <a
href="https://glaredb.com/">GlareDB</a>.</p>
+<h1>Try it for yourself</h1>
+<p>The tool is entirely open source under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0 license</a>.
Visit the <a href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs
repository</a> or try it for yourself by run the following commands after <a
href="https://www.rust-lang.org/tools/install">installing Rust</a>:</p>
+<pre><code class="language-shell">$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+</code></pre>
+<h1>What is TPC-H / dbgen?</h1>
+<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (often
referred to as TPCH) helps evaluate the
+performance of database systems on <a
href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>
queries, the kind used to build BI
+dashboards.</p>
+<p>TPC-H has become a de facto standard for analytic systems. While there are
<a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well
+known</a> limitations as the data and queries do not well represent many real
world
+use cases, the majority of analytic database papers and industrial systems
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for “<code>TPCH Performance <your favorite
database></code>” in any
+search engine.</p>
+<p>The benchmark was created at a time when access to high performance
analytical
+systems was not widespread, so the <a href="https://www.tpc.org/">Transaction
Processing Performance Council</a>
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users
to
+run and verify TPC-H performance themselves.</p>
+<p>TPC-H simulates a business environment with eight tables:
<code>REGION</code>, <code>NATION</code>,
+<code>SUPPLIER</code>, <code>CUSTOMER</code>, <code>PART</code>,
<code>PARTSUPP</code>, <code>ORDERS</code>, and <code>LINEITEM</code>. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.</p>
+<p>The queries run against data created with <code><a
href="https://github.com/electrum/tpch-dbgen">dbgen</a></code>, a program
+written in a pre <a href="https://en.wikipedia.org/wiki/C99">C-99</a> dialect,
which generates data in a format called <em>TBL</em>
+(example in Figure 2). <code>dbgen</code> creates data for each of the 8
tables for a
+certain <em>Scale Factor</em>, commonly abbreviated as SF. Example Scale
Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.</p>
+<pre><code
class="language-text">103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
accou|
+229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
pending |
+263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
express fo|
+327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR|
asymptotes are fu|
+450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve.
asymptote|
+</code></pre>
+<p><strong>Figure 2</strong>: Example TBL formatted output of
<code>dbgen</code> for the <code>LINEITEM</code> table</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Data Size (TBL)</strong>
+</td>
+<td><strong>Data Size (Parquet)</strong>
+</td>
+</tr>
+<tr>
+<td>0.1
+ </td>
+<td>103 Mb
+ </td>
+<td>31 Mb
+ </td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>1 Gb
+ </td>
+<td>340 Mb
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>10 Gb
+ </td>
+<td>3.6 Gb
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>107 Gb
+ </td>
+<td>38 Gb
+ </td>
+</tr>
+<tr>
+<td>1000
+ </td>
+<td>1089 Gb
+ </td>
+<td>379 Gb
+ </td>
+</tr>
+</table>
+<p><strong>Table 1</strong>: TPC-H data set sizes at different scale factors
for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p>
+<h1>Why do we need a new TPC-H Data generator?</h1>
+<p>Despite the known limitations of the TPC-H benchmark, it is so well known
that it
+is used frequently in database performance analysis. To run TPC-H, you must
first
+load the data, using <code>dbgen</code>, which is not ideal for several
reasons:</p>
+<ol>
+<li>You must find and compile a copy of the 15+ year old C program (for
example <a
href="https://github.com/electrum/tpch-dbgen">electrum/tpch-dbgen</a>)</li>
+<li><code>dbgen</code> requires substantial time (Figure 3) and is not able to
use more than one core.</li>
+<li>It outputs TBL format, which typically requires loading into your database
(for example, <a
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124">here
is how to do so</a> in Apache DataFusion) prior to query.</li>
+<li>The implementation makes substantial assumptions about the operating
environment, making it difficult to extend or embed into other
systems.<sup>2</sup></li>
+</ol>
+<p><img alt="Time to generate TPC-H data in TBL format" class="img-responsive"
src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p>
+<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format.
<code>tpchgen</code> is
+shown in blue. <code>tpchgen</code> restricted to a single core is shown in
red. Unmodified
+<code>dbgen</code> is shown in green and <code>dbgen</code> modified to use
<code>-O3</code> optimization level
+is shown in yellow.</p>
+<p><code>dbgen</code> is so inconvenient and takes so long that vendors often
provide
+preloaded TPC-H data, for example <a
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch">Snowflake
Sample Data</a>, <a
href="https://docs.databricks.com/aws/en/discover/databricks-datasets">Databricks
Sample
+datasets</a> and <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">DuckDB
Pre-Generated Data Sets</a>.</p>
+<p>In addition to pre-generated datasets, DuckDB also provides a <a
href="https://duckdb.org/docs/stable/extensions/tpch.html">TPC-H extension</a>
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, <a
href="https://github.com/lmwnshn">Wan Shen
+Lim</a> explicitly <a
href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of
creating the TPC-H dataset</a> as one reason
+the first student project of <a
href="https://15799.courses.cs.cmu.edu/spring2025/">CMU-799 Spring 2025</a>
used DuckDB.</p>
+<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for several
reasons:</p>
+<ol>
+<li>Creates data in a proprietary format, which requires export to use in
other systems.</li>
+<li>Requires significant time (e.g. 17 minutes for Scale Factor 10).</li>
+<li>Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale
Factor 10)</li>
+</ol>
+<p>The above limitations makes it impractical to generate Scale Factor 100 and
+above on laptops or standard workstations, though DuckDB offers <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">pre-computed
+files</a> for larger factors<sup>3</sup>.</p>
+<h1>Why Rust?</h1>
+<p>Realistically we used Rust because we wanted to integrate the data generator
+into <a href="https://datafusion.apache.org/">Apache DataFusion</a> and <a
href="https://glaredb.com/">GlareDB</a>. However, we also believe Rust is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.</p>
+<p>While Rust does allow unsafe access to memory (eliding bounds checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only <a
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&type=code">unsafe</a>
code is used to <a
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72">skip</a>
UTF8 validation on known ASCII
+strings.</p>
+<p><img alt="Lamb Theory on Evolution of Systems Languages"
class="img-responsive"
src="/blog/images/fastest-tpch-generator/lamb-theory.png" width="80%"/></p>
+<p><strong>Figure 4</strong>: Lamb Theory of System Language Evolution from <a
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf">Boston
University
+MiDAS Fall 2024 (Data Systems Seminar)</a>, <a
href="https://www.youtube.com/watch?v=CpnxuBwHbUc">recording</a>. Special
+thanks to <a href="https://x.com/KurtFehlhauer">@KurtFehlhauer</a></p>
+<h1>How: The Journey</h1>
+<p>We did it together as a team in the open over the course of a few weeks.
+<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by
pointing out the benefits of <a
href="https://github.com/apache/datafusion/issues/14373">easy TPC-H
+dataset creation</a> and <a
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600">suggesting
we check out a Java port on February 11,
+2025</a>. Achraf made <a
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0">first
commit a few days later</a> on February 16, and <a
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e">Andrew
+and Sean started helping on March 8, 2025</a> and we <a
href="https://crates.io/crates/tpchgen/0.1.0">released version 0.1</a> on
+March 30, 2025.</p>
+<h2>Optimizing Single Threaded Performance</h2>
+<p>Archaf <a href="https://github.com/clflushopt/tpchgen-rs/pull/16">completed
the end to end conformance tests</a>, to ensure correctness, and
+an initial <a href="https://github.com/clflushopt/tpchgen-rs/pull/12">cli
check in</a> on March 15, 2025.</p>
+<p>On a Macbook Pro M3 (Nov 2023), the initial performance numbers were
actually
+slower than the original Java implementation which was ported 😭. This
wasn’t
+surprising since the focus of the first version was to get a byte of byte
+compatible port, and knew about the performance shortcomings and how to
approach
+them.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m10.307s
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>1m26.530s
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>14m56.986s
+ </td>
+</tr>
+</table>
+<p><strong>Table 2</strong>: Performance of running <a
href="https://github.com/clflushopt/tpchgen-rs/pull/12">the initial
tpchgen-cli</a>, measured with
+<code>time target/release/tpchgen-cli -s $SCALE_FACTOR</code></p>
+<p>With this strong foundation we began optimizing the code using Rust’s
low level
+memory management to improve performance while retaining memory safely. We
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:</p>
+<ul>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/19">Avoiding
startup overhead</a>,</li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/26">not</a> <a
href="https://github.com/clflushopt/tpchgen-rs/pull/32">copying</a> strings
(many more PRs as well)</li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/27">Rust’s
zero overhead abstractions for dates</a></li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/35">Static
strings</a> (entirely safely with static lifetimes)</li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/33">Generics to
avoid virtual function call overhead</a></li>
+<li><a href="https://github.com/clflushopt/tpchgen-rs/pull/62">Moving lookups
from runtime</a> to load time</li>
+</ul>
+<p>At the time of writing, single threaded performance is now 2.5x-2.7x faster
than the initial version, as shown in Table 3.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m4.079s
+ </td>
+<td>2.5x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m31.616s
+ </td>
+<td>2.7x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>5m28.083s
+ </td>
+<td>2.7x
+ </td>
+</tr>
+</table>
+<p><strong>Table 3</strong>: Single threaded <code>tpchgen-cli</code>
performance, measured with <code>time target/release/tpchgen-cli -s
$SCALE_FACTOR --num-threads=1</code></p>
+<h2>Multi-threading</h2>
+<p>Then we applied <a
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html">Rust’s
fearless concurrency</a> – with a single, <a
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499">small
PR</a> (272
+net new lines) we updated the same memory safe code to run with multiple
threads
+and consume bounded memory using <a
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/">tokio
for the thread scheduler</a><sup>4</sup>.</p>
+<p>As shown in Table 4, with this change, tpchgen-cli generates the full SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+<code>dbgen</code> or <code>duckdb</code>) – it creates data
<strong>faster than can be written to an SSD</strong>.
+When writing to <code>/dev/null</code> tpchgen generates the entire dataset
in 25 seconds
+(4 GB/s).</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster than initial implementation</strong>
+</td>
+<td><strong>Times faster than optimized single threaded</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m1.369s
+ </td>
+<td>7.3x
+ </td>
+<td>3x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m3.828s
+ </td>
+<td>22.6x
+ </td>
+<td>8.2x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>0m32.615s
+ </td>
+<td>27.5x
+ </td>
+<td>10x
+ </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+ </td>
+<td>0m25.088s
+ </td>
+<td>35.7x
+ </td>
+<td>13.1x
+ </td>
+</tr>
+</table>
+<p><strong>Table 4</strong>: tpchgen-cli (multithreaded) performance measured
with <code>time target/release/tpchgen-cli -s $SCALE_FACTOR</code></p>
+<p>Using Rust and async streams, the data generator is also fully streaming:
memory
+use does not increase with increasing data size / scale factors<sup>5</sup>.
The DuckDB
+generator seems to <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">require
far more memory</a> than is commonly available on
+developer laptops and memory use increases with scale factor. With
<code>tpchgen-cli</code>
+it is perfectly possible to create data for SF=10000 or larger on a machine
with
+16GB of memory (assuming sufficient storage capacity).</p>
+<h2>Direct to parquet</h2>
+<p>At this point, <code>tpchgen-cli</code> could very quickly generate the TBL
format.
+However, as described above, the TBL is annoying to work with, because</p>
+<ol>
+<li>It has no header</li>
+<li>It is like a CSV but the delimiter is <code>|</code></li>
+<li>Each line ends with an extra <code>|</code> delimiter before the newline
🙄</li>
+<li>No system that we know can read them without additional configuration.</li>
+</ol>
+<p>We next <a href="https://github.com/clflushopt/tpchgen-rs/pull/54">added
support for CSV</a> generation (special thanks <a
href="https://github.com/niebayes">@niebayes</a> from
+Datalayers for finding and <a
href="https://github.com/clflushopt/tpchgen-rs/issues/73">fixing</a> <a
href="https://github.com/clflushopt/tpchgen-rs/issues/65">bugs</a>) which
performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p>
+<p>What would be far more useful is a typed, efficient columnar format such as
+Apache Parquet which is supported by all modern query engines. So we <a
href="https://github.com/clflushopt/tpchgen-rs/pull/71">made</a> a
+<a href="https://crates.io/crates/tpchgen-arrow">tpchgen-arrow</a> crate to
create <a href="https://arrow.apache.org/">Apache Arrow</a> arrays directly and
then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61">a small
+300 line PR</a> to feed those arrays to the <a
href="https://crates.io/crates/parquet">Rust Parquet writer</a>, again using
+tokio for parallelized but memory bound work.</p>
+<p>This approach was simple, fast and scalable, as shown in Table 5. Even
though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time to generate Parquet</strong>
+</td>
+<td><strong>Speed compared to tbl generation</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m1.649s
+ </td>
+<td>0.8x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m5.643s
+ </td>
+<td>0.7x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>0m45.243s
+ </td>
+<td>0.7x
+ </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+ </td>
+<td>0m45.153s
+ </td>
+<td>0.5x
+ </td>
+</tr>
+</table>
+<p><strong>Table 5</strong>: <code>tpchgen-cli</code> Parquet generation
performance measured with <code>time
+target/release/tpchgen-cli -s $SCALE_FACTOR --format=parquet</code></p>
+<h1>Conclusion 👊🎤</h1>
+<p>In just a few days, with some fellow database nerds and the power of Rust,
we built something 10x better than what currently exists. We hope it inspires
more research
+into analytical systems using the TPC-H dataset and that people build awesome
+things with it. For example, Sean has already added <a
href="https://github.com/GlareDB/glaredb/pull/3549">on-demand generation of
+tables to GlareDB</a>. Please consider joining us and helping out at
+<a
href="https://github.com/clflushopt/tpchgen-rs">https://github.com/clflushopt/tpchgen-rs</a>.</p>
+<p>We met while working together on Apache DataFusion in various capacities.
If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to <a
href="https://datafusion.apache.org/contributor-guide/communication.html">come
join us</a>. We are in the process of integrating this into
+DataFusion (see <a
href="https://github.com/apache/datafusion/issues/14608">apache/datafusion#14608</a>)
if you are interested in helping 🎣</p>
+<h1>About the Authors:</h1>
+<ul>
+<li><a href="https://www.linkedin.com/in/andrewalamb/">Andrew Lamb</a> (<a
href="https://github.com/alamb">@alamb</a>) is a Staff Engineer at <a
href="https://www.influxdata.com/">InfluxData</a> and a PMC member of <a
href="https://datafusion.apache.org/">Apache DataFusion</a> and <a
href="https://arrow.apache.org/">Apache Arrow</a>.</li>
+<li>Achraf B (<a href="https://github.com/clflushopt">@clflushopt</a>) is a
Software Engineer at <a href="https://optable.co/">Optable</a> where he works
on data infrastructure.</li>
+<li><a href="https://www.linkedin.com/in/scsmithr/">Sean Smith</a> (<a
href="https://github.com/scsmithr">@scsmithr</a>) is the founder of <a
href="https://glaredb.com/">GlareDB</a> focused on building a fast analytics
database.</li>
+</ul>
+<!-- Footnotes themselves at the bottom. -->
+<h2>Footnotes</h2>
+<p><em>1</em>: Actual Time: <code>30:35</code></p>
+<p><em>2</em>: It is possible to embed the dbgen code, which appears to be the
approach taken by DuckDB. This approach was tried in GlareDB (<a
href="https://github.com/GlareDB/glaredb/pull/3313">GlareDB/glaredb#3313</a>),
but ultimately shelved given the amount of effort needed to adapt and isolate
the dbgen code.</p>
+<p><em>3</em>: It is pretty amazing to imagine the machine required to
generate SF300 that had 1.8TB (!!) of RAM</p>
+<p><em>4</em>: We tried to <a
href="https://github.com/clflushopt/tpchgen-rs/pull/34">use Rayon (see
discussion here)</a>, but could not easily keep memory bounded.</p>
+<p><em>5</em>: <code>tpchgen-cli</code> memory usage is a function of the
number of threads: each thread needs some buffer space</p>
+ </div>
+ </div>
+ </div>
+ <!-- footer -->
+ <div class="row">
+ <div class="large-12 medium-12 columns">
+ <p style="font-style: italic; font-size: 0.8rem; text-align: center;">
+ Copyright 2025, <a href="https://www.apache.org/">The Apache
Software Foundation</a>, Licensed under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version
2.0</a>.<br/>
+ Apache® and the Apache feather logo are trademarks of The Apache
Software Foundation.
+ </p>
+ </div>
+ </div>
+ <script src="/blog/js/bootstrap.bundle.min.js"></script> </main>
+ </body>
+</html>
diff --git a/output/author/andrew-lamb-achraf-b-and-sean-smith.html
b/output/author/andrew-lamb-achraf-b-and-sean-smith.html
new file mode 100644
index 0000000..7b48354
--- /dev/null
+++ b/output/author/andrew-lamb-achraf-b-and-sean-smith.html
@@ -0,0 +1,116 @@
+ <!doctype html>
+ <html class="no-js" lang="en" dir="ltr">
+ <head>
+ <meta charset="utf-8">
+ <meta http-equiv="x-ua-compatible" content="ie=edge">
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
+ <title>Apache DataFusion Blog</title>
+<link href="/blog/css/bootstrap.min.css" rel="stylesheet">
+<link href="/blog/css/fontawesome.all.min.css" rel="stylesheet">
+<link href="/blog/css/headerlink.css" rel="stylesheet">
+<link href="/blog/highlight/default.min.css" rel="stylesheet">
+<script src="/blog/highlight/highlight.js"></script>
+<script>hljs.highlightAll();</script> <link
href="/blog/css/blog_index.css" rel="stylesheet">
+ </head>
+ <body class="d-flex flex-column h-100">
+ <main class="flex-shrink-0">
+ <div>
+
+<!-- nav bar -->
+<nav class="navbar navbar-expand-lg navbar-dark bg-dark" aria-label="Fifth
navbar example">
+ <div class="container-fluid">
+ <a class="navbar-brand" href="/blog"><img
src="/blog/images/logo_original4x.png" style="height: 32px;"/> Apache
DataFusion Blog</a>
+ <button class="navbar-toggler" type="button" data-bs-toggle="collapse"
data-bs-target="#navbarADP" aria-controls="navbarADP" aria-expanded="false"
aria-label="Toggle navigation">
+ <span class="navbar-toggler-icon"></span>
+ </button>
+
+ <div class="collapse navbar-collapse" id="navbarADP">
+ <ul class="navbar-nav me-auto mb-2 mb-lg-0">
+ <li class="nav-item">
+ <a class="nav-link" href="/blog/about.html">About</a>
+ </li>
+ <li class="nav-item">
+ <a class="nav-link" href="/blog/feed.xml">RSS</a>
+ </li>
+ </ul>
+ </div>
+ </div>
+</nav>
+ <div id="contents">
+ <div class="bg-white p-5 rounded">
+ <div class="col-sm-8 mx-auto">
+<div id="contents">
+ <div class="bg-white p-5 rounded">
+ <div class="col-sm-8 mx-auto">
+
+ <h3>Welcome to the Apache DataFusion Blog!</h3>
+ <p><i>Here you can find the latest updates from DataFusion and
related projects.</i></p>
+
+
+ <!-- Post -->
+ <div class="row">
+ <div class="callout">
+ <article class="post">
+ <header>
+ <div class="title">
+ <h1><a
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open
source TPC-H data generator, written in Rust</a></h1>
+ <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf
B, and Sean Smith</p>
+ <p><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with
dbgen</strong>.</p>
+<p>3 members of the <a href="https://datafusion.apache.org/">Apache
DataFusion</a> community used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a fully open
TPC-H data generator over …</p></p>
+ <footer>
+ <ul class="actions">
+ <div style="text-align: right"><a
href="/blog/2025/04/10/fastest-tpch-generator" class="button medium">Continue
Reading</a></div>
+ </ul>
+ <ul class="stats">
+ </ul>
+ </footer>
+ </article>
+ </div>
+ </div>
+
+ </div>
+ </div>
+</div> </div>
+ </div>
+ </div>
+
+ <!-- footer -->
+ <div class="row">
+ <div class="large-12 medium-12 columns">
+ <p style="font-style: italic; font-size: 0.8rem; text-align: center;">
+ Copyright 2025, <a href="https://www.apache.org/">The Apache
Software Foundation</a>, Licensed under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version
2.0</a>.<br/>
+ Apache® and the Apache feather logo are trademarks of The Apache
Software Foundation.
+ </p>
+ </div>
+ </div>
+ <script src="/blog/js/bootstrap.bundle.min.js"></script> </div>
+ </main>
+ </body>
+ </html>
diff --git a/output/category/blog.html b/output/category/blog.html
index 4385e63..bcf7e8c 100644
--- a/output/category/blog.html
+++ b/output/category/blog.html
@@ -47,6 +47,53 @@
<p><i>Here you can find the latest updates from DataFusion and
related projects.</i></p>
+ <!-- Post -->
+ <div class="row">
+ <div class="callout">
+ <article class="post">
+ <header>
+ <div class="title">
+ <h1><a
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open
source TPC-H data generator, written in Rust</a></h1>
+ <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf
B, and Sean Smith</p>
+ <p><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with
dbgen</strong>.</p>
+<p>3 members of the <a href="https://datafusion.apache.org/">Apache
DataFusion</a> community used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a fully open
TPC-H data generator over …</p></p>
+ <footer>
+ <ul class="actions">
+ <div style="text-align: right"><a
href="/blog/2025/04/10/fastest-tpch-generator" class="button medium">Continue
Reading</a></div>
+ </ul>
+ <ul class="stats">
+ </ul>
+ </footer>
+ </article>
+ </div>
+ </div>
<!-- Post -->
<div class="row">
<div class="callout">
diff --git a/output/feed.xml b/output/feed.xml
index b09ca71..430f736 100644
--- a/output/feed.xml
+++ b/output/feed.xml
@@ -1,5 +1,33 @@
<?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Sun,
30 Mar 2025 00:00:00 +0000</lastBuildDate><item><title>Apache DataFusion
Python 46.0.0
Released</title><link>https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0</link><description><!--
+<rss version="2.0"><channel><title>Apache DataFusion
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs World’s
fastest open source TPC-H data generator, written in
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+
with dbgen</strong>.</p>
+<p>3 members of the <a
href="https://datafusion.apache.org/">Apache DataFusion</a> community
used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a
fully open TPC-H data generator over …</p></description><dc:creator
xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Lamb, Achraf B, and Sean
Smith</dc:creator><pubDate>Thu, 10 Apr 2025 00:00:00 +0000</pubDate><guid
isPermaLink="false">tag:datafusion.apache.org,2025-04-10:/blog/2025/04/10/fastest-tpch-generator</guid><category>blog</category></item><item><title>Apache
DataFusion [...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/feeds/all-en.atom.xml b/output/feeds/all-en.atom.xml
index 753370c..bde3316 100644
--- a/output/feeds/all-en.atom.xml
+++ b/output/feeds/all-en.atom.xml
@@ -1,5 +1,473 @@
<?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion
Blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-30T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
DataFusion Python 46.0.0 Released</title><link
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0
[...]
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion
Blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
World’s fastest open source TPC-H data generator, written in Rust</title><link
href="https://datafusion.apache.org/blog [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+
with dbgen</strong>.</p>
+<p>3 members of the <a
href="https://datafusion.apache.org/">Apache DataFusion</a> community
used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a
fully open TPC-H data generator over …</p></summary><content
type="html"><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+
with dbgen</strong>.</p>
+<p>3 members of the <a
href="https://datafusion.apache.org/">Apache DataFusion</a> community
used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a
fully open TPC-H data generator over 20x
+faster than any other implementation we know of.</p>
+<p>It is now possible to create the TPC-H SF=100 dataset in 72.23
seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic
<code>dbgen</code>
+which takes 30 minutes<sup>1</sup> (0.05GB/sec). On the same
machine, it takes less than
+2 minutes to create all 3.6 GB of SF=100 in <a
href="https://parquet.apache.org/">Apache Parquet</a> format, which
takes 44 minutes using <a href="https://duckdb.org">DuckDB</a>.
+It is finally convenient and efficient to run TPC-H queries locally when
testing
+analytical engines such as DataFusion.</p>
+<p><img alt="Time to create TPC-H parquet dataset for Scale Factor
1, 10, 100 and 1000" class="img-responsive"
src="/blog/images/fastest-tpch-generator/parquet-performance.png"
width="80%"/></p>
+<p><strong>Figure 1</strong>: Time to create TPC-H dataset
for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100
<code>tpchgen</code> takes 1 minute and 14 seconds and
+<a href="https://duckdb.org">DuckDB</a> takes 17 minutes and 48
seconds. For SF=1000, <code>tpchgen</code> takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&rsquo;s time as it <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">requires
647 GB of RAM</a>, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+<a
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md">documentation</a>.</p>
+<p>This blog explains what TPC-H is, how we ported the vintage C data
generator to
+Rust (yes, <a
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/">RWIR</a>)
and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data
in
+<a href="https://datafusion.apache.org/">Apache DataFusion</a> and
<a href="https://glaredb.com/">GlareDB</a>.</p>
+<h1>Try it for yourself</h1>
+<p>The tool is entirely open source under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0
license</a>. Visit the <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs
repository</a> or try it for yourself by run the following commands after
<a href="https://www.rust-lang.org/tools/install">installing
Rust</a>:</p>
+<pre><code class="language-shell">$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+</code></pre>
+<h1>What is TPC-H / dbgen?</h1>
+<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a>
benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on <a
href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>
queries, the kind used to build BI
+dashboards.</p>
+<p>TPC-H has become a de facto standard for analytic systems. While
there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well
+known</a> limitations as the data and queries do not well represent many
real world
+use cases, the majority of analytic database papers and industrial systems
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for &ldquo;<code>TPCH Performance &lt;your
favorite database&gt;</code>&rdquo; in any
+search engine.</p>
+<p>The benchmark was created at a time when access to high performance
analytical
+systems was not widespread, so the <a
href="https://www.tpc.org/">Transaction Processing Performance
Council</a>
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users
to
+run and verify TPC-H performance themselves.</p>
+<p>TPC-H simulates a business environment with eight tables:
<code>REGION</code>, <code>NATION</code>,
+<code>SUPPLIER</code>, <code>CUSTOMER</code>,
<code>PART</code>, <code>PARTSUPP</code>,
<code>ORDERS</code>, and <code>LINEITEM</code>. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.</p>
+<p>The queries run against data created with <code><a
href="https://github.com/electrum/tpch-dbgen">dbgen</a></code>,
a program
+written in a pre <a
href="https://en.wikipedia.org/wiki/C99">C-99</a> dialect, which
generates data in a format called <em>TBL</em>
+(example in Figure 2). <code>dbgen</code> creates data for each of
the 8 tables for a
+certain <em>Scale Factor</em>, commonly abbreviated as SF. Example
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.</p>
+<pre><code
class="language-text">103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
accou|
+229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
pending |
+263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
express fo|
+327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR|
asymptotes are fu|
+450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve.
asymptote|
+</code></pre>
+<p><strong>Figure 2</strong>: Example TBL formatted output
of <code>dbgen</code> for the <code>LINEITEM</code>
table</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Data Size (TBL)</strong>
+</td>
+<td><strong>Data Size (Parquet)</strong>
+</td>
+</tr>
+<tr>
+<td>0.1
+ </td>
+<td>103 Mb
+ </td>
+<td>31 Mb
+ </td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>1 Gb
+ </td>
+<td>340 Mb
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>10 Gb
+ </td>
+<td>3.6 Gb
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>107 Gb
+ </td>
+<td>38 Gb
+ </td>
+</tr>
+<tr>
+<td>1000
+ </td>
+<td>1089 Gb
+ </td>
+<td>379 Gb
+ </td>
+</tr>
+</table>
+<p><strong>Table 1</strong>: TPC-H data set sizes at
different scale factors for both TBL and <a
href="https://parquet.apache.org/">Apache Parquet</a>.</p>
+<h1>Why do we need a new TPC-H Data generator?</h1>
+<p>Despite the known limitations of the TPC-H benchmark, it is so well
known that it
+is used frequently in database performance analysis. To run TPC-H, you must
first
+load the data, using <code>dbgen</code>, which is not ideal for
several reasons:</p>
+<ol>
+<li>You must find and compile a copy of the 15+ year old C program (for
example <a
href="https://github.com/electrum/tpch-dbgen">electrum/tpch-dbgen</a>)</li>
+<li><code>dbgen</code> requires substantial time (Figure 3)
and is not able to use more than one core.</li>
+<li>It outputs TBL format, which typically requires loading into your
database (for example, <a
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124">here
is how to do so</a> in Apache DataFusion) prior to query.</li>
+<li>The implementation makes substantial assumptions about the operating
environment, making it difficult to extend or embed into other
systems.<sup>2</sup></li>
+</ol>
+<p><img alt="Time to generate TPC-H data in TBL format"
class="img-responsive"
src="/blog/images/fastest-tpch-generator/tbl-performance.png"
width="80%"/></p>
+<p><strong>Figure 3</strong>: Time to generate TPC-H data in
TBL format. <code>tpchgen</code> is
+shown in blue. <code>tpchgen</code> restricted to a single core is
shown in red. Unmodified
+<code>dbgen</code> is shown in green and
<code>dbgen</code> modified to use <code>-O3</code>
optimization level
+is shown in yellow.</p>
+<p><code>dbgen</code> is so inconvenient and takes so long
that vendors often provide
+preloaded TPC-H data, for example <a
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch">Snowflake
Sample Data</a>, <a
href="https://docs.databricks.com/aws/en/discover/databricks-datasets">Databricks
Sample
+datasets</a> and <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">DuckDB
Pre-Generated Data Sets</a>.</p>
+<p>In addition to pre-generated datasets, DuckDB also provides a <a
href="https://duckdb.org/docs/stable/extensions/tpch.html">TPC-H
extension</a>
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, <a
href="https://github.com/lmwnshn">Wan Shen
+Lim</a> explicitly <a
href="https://github.com/apache/datafusion/issues/14373">mentioned the ease
of creating the TPC-H dataset</a> as one reason
+the first student project of <a
href="https://15799.courses.cs.cmu.edu/spring2025/">CMU-799 Spring
2025</a> used DuckDB.</p>
+<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for
several reasons:</p>
+<ol>
+<li>Creates data in a proprietary format, which requires export to use
in other systems.</li>
+<li>Requires significant time (e.g. 17 minutes for Scale Factor
10).</li>
+<li>Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale
Factor 10)</li>
+</ol>
+<p>The above limitations makes it impractical to generate Scale Factor
100 and
+above on laptops or standard workstations, though DuckDB offers <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">pre-computed
+files</a> for larger factors<sup>3</sup>.</p>
+<h1>Why Rust?</h1>
+<p>Realistically we used Rust because we wanted to integrate the data
generator
+into <a href="https://datafusion.apache.org/">Apache
DataFusion</a> and <a
href="https://glaredb.com/">GlareDB</a>. However, we also believe Rust
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.</p>
+<p>While Rust does allow unsafe access to memory (eliding bounds
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only <a
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;type=code">unsafe</a>
code is used to <a
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72">skip</a>
UTF8 validation on known ASCII
+strings.</p>
+<p><img alt="Lamb Theory on Evolution of Systems Languages"
class="img-responsive"
src="/blog/images/fastest-tpch-generator/lamb-theory.png"
width="80%"/></p>
+<p><strong>Figure 4</strong>: Lamb Theory of System Language
Evolution from <a
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf">Boston
University
+MiDAS Fall 2024 (Data Systems Seminar)</a>, <a
href="https://www.youtube.com/watch?v=CpnxuBwHbUc">recording</a>.
Special
+thanks to <a
href="https://x.com/KurtFehlhauer">@KurtFehlhauer</a></p>
+<h1>How: The Journey</h1>
+<p>We did it together as a team in the open over the course of a few
weeks.
+<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the
project by pointing out the benefits of <a
href="https://github.com/apache/datafusion/issues/14373">easy TPC-H
+dataset creation</a> and <a
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600">suggesting
we check out a Java port on February 11,
+2025</a>. Achraf made <a
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0">first
commit a few days later</a> on February 16, and <a
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e">Andrew
+and Sean started helping on March 8, 2025</a> and we <a
href="https://crates.io/crates/tpchgen/0.1.0">released version 0.1</a>
on
+March 30, 2025.</p>
+<h2>Optimizing Single Threaded Performance</h2>
+<p>Archaf <a
href="https://github.com/clflushopt/tpchgen-rs/pull/16">completed the end to
end conformance tests</a>, to ensure correctness, and
+an initial <a
href="https://github.com/clflushopt/tpchgen-rs/pull/12">cli check
in</a> on March 15, 2025.</p>
+<p>On a Macbook Pro M3 (Nov 2023), the initial performance numbers were
actually
+slower than the original Java implementation which was ported 😭. This
wasn&rsquo;t
+surprising since the focus of the first version was to get a byte of byte
+compatible port, and knew about the performance shortcomings and how to
approach
+them.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m10.307s
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>1m26.530s
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>14m56.986s
+ </td>
+</tr>
+</table>
+<p><strong>Table 2</strong>: Performance of running <a
href="https://github.com/clflushopt/tpchgen-rs/pull/12">the initial
tpchgen-cli</a>, measured with
+<code>time target/release/tpchgen-cli -s
$SCALE_FACTOR</code></p>
+<p>With this strong foundation we began optimizing the code using
Rust&rsquo;s low level
+memory management to improve performance while retaining memory safely. We
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:</p>
+<ul>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/19">Avoiding startup
overhead</a>,</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/26">not</a> <a
href="https://github.com/clflushopt/tpchgen-rs/pull/32">copying</a>
strings (many more PRs as well)</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/27">Rust&rsquo;s
zero overhead abstractions for dates</a></li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/35">Static
strings</a> (entirely safely with static lifetimes)</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/33">Generics to avoid
virtual function call overhead</a></li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/62">Moving lookups from
runtime</a> to load time</li>
+</ul>
+<p>At the time of writing, single threaded performance is now 2.5x-2.7x
faster than the initial version, as shown in Table 3.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m4.079s
+ </td>
+<td>2.5x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m31.616s
+ </td>
+<td>2.7x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>5m28.083s
+ </td>
+<td>2.7x
+ </td>
+</tr>
+</table>
+<p><strong>Table 3</strong>: Single threaded
<code>tpchgen-cli</code> performance, measured with
<code>time target/release/tpchgen-cli -s $SCALE_FACTOR
--num-threads=1</code></p>
+<h2>Multi-threading</h2>
+<p>Then we applied <a
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html">Rust&rsquo;s
fearless concurrency</a> &ndash; with a single, <a
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499">small
PR</a> (272
+net new lines) we updated the same memory safe code to run with multiple
threads
+and consume bounded memory using <a
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/">tokio
for the thread scheduler</a><sup>4</sup>.</p>
+<p>As shown in Table 4, with this change, tpchgen-cli generates the full
SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+<code>dbgen</code> or <code>duckdb</code>) &ndash;
it creates data <strong>faster than can be written to an
SSD</strong>.
+When writing to <code>/dev/null</code> tpchgen generates the
entire dataset in 25 seconds
+(4 GB/s).</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster than initial implementation</strong>
+</td>
+<td><strong>Times faster than optimized single
threaded</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m1.369s
+ </td>
+<td>7.3x
+ </td>
+<td>3x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m3.828s
+ </td>
+<td>22.6x
+ </td>
+<td>8.2x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>0m32.615s
+ </td>
+<td>27.5x
+ </td>
+<td>10x
+ </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+ </td>
+<td>0m25.088s
+ </td>
+<td>35.7x
+ </td>
+<td>13.1x
+ </td>
+</tr>
+</table>
+<p><strong>Table 4</strong>: tpchgen-cli (multithreaded)
performance measured with <code>time target/release/tpchgen-cli -s
$SCALE_FACTOR</code></p>
+<p>Using Rust and async streams, the data generator is also fully
streaming: memory
+use does not increase with increasing data size / scale
factors<sup>5</sup>. The DuckDB
+generator seems to <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">require
far more memory</a> than is commonly available on
+developer laptops and memory use increases with scale factor. With
<code>tpchgen-cli</code>
+it is perfectly possible to create data for SF=10000 or larger on a machine
with
+16GB of memory (assuming sufficient storage capacity).</p>
+<h2>Direct to parquet</h2>
+<p>At this point, <code>tpchgen-cli</code> could very
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with,
because</p>
+<ol>
+<li>It has no header</li>
+<li>It is like a CSV but the delimiter is
<code>|</code></li>
+<li>Each line ends with an extra <code>|</code> delimiter
before the newline 🙄</li>
+<li>No system that we know can read them without additional
configuration.</li>
+</ol>
+<p>We next <a
href="https://github.com/clflushopt/tpchgen-rs/pull/54">added support for
CSV</a> generation (special thanks <a
href="https://github.com/niebayes">@niebayes</a> from
+Datalayers for finding and <a
href="https://github.com/clflushopt/tpchgen-rs/issues/73">fixing</a>
<a
href="https://github.com/clflushopt/tpchgen-rs/issues/65">bugs</a>)
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p>
+<p>What would be far more useful is a typed, efficient columnar format
such as
+Apache Parquet which is supported by all modern query engines. So we <a
href="https://github.com/clflushopt/tpchgen-rs/pull/71">made</a> a
+<a href="https://crates.io/crates/tpchgen-arrow">tpchgen-arrow</a>
crate to create <a href="https://arrow.apache.org/">Apache
Arrow</a> arrays directly and then <a
href="https://github.com/clflushopt/tpchgen-rs/pull/61">a small
+300 line PR</a> to feed those arrays to the <a
href="https://crates.io/crates/parquet">Rust Parquet writer</a>, again
using
+tokio for parallelized but memory bound work.</p>
+<p>This approach was simple, fast and scalable, as shown in Table 5.
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time to generate Parquet</strong>
+</td>
+<td><strong>Speed compared to tbl generation</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m1.649s
+ </td>
+<td>0.8x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m5.643s
+ </td>
+<td>0.7x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>0m45.243s
+ </td>
+<td>0.7x
+ </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+ </td>
+<td>0m45.153s
+ </td>
+<td>0.5x
+ </td>
+</tr>
+</table>
+<p><strong>Table 5</strong>:
<code>tpchgen-cli</code> Parquet generation performance measured
with <code>time
+target/release/tpchgen-cli -s $SCALE_FACTOR
--format=parquet</code></p>
+<h1>Conclusion 👊🎤</h1>
+<p>In just a few days, with some fellow database nerds and the power of
Rust, we built something 10x better than what currently exists. We hope it
inspires more research
+into analytical systems using the TPC-H dataset and that people build awesome
+things with it. For example, Sean has already added <a
href="https://github.com/GlareDB/glaredb/pull/3549">on-demand generation of
+tables to GlareDB</a>. Please consider joining us and helping out at
+<a
href="https://github.com/clflushopt/tpchgen-rs">https://github.com/clflushopt/tpchgen-rs</a>.</p>
+<p>We met while working together on Apache DataFusion in various
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to <a
href="https://datafusion.apache.org/contributor-guide/communication.html">come
join us</a>. We are in the process of integrating this into
+DataFusion (see <a
href="https://github.com/apache/datafusion/issues/14608">apache/datafusion#14608</a>)
if you are interested in helping 🎣</p>
+<h1>About the Authors:</h1>
+<ul>
+<li><a href="https://www.linkedin.com/in/andrewalamb/">Andrew
Lamb</a> (<a href="https://github.com/alamb">@alamb</a>) is a
Staff Engineer at <a
href="https://www.influxdata.com/">InfluxData</a> and a PMC member of
<a href="https://datafusion.apache.org/">Apache DataFusion</a> and
<a href="https://arrow.apache.org/">Apache Arrow</a>.</li>
+<li>Achraf B (<a
href="https://github.com/clflushopt">@clflushopt</a>) is a Software
Engineer at <a href="https://optable.co/">Optable</a> where he
works on data infrastructure.</li>
+<li><a href="https://www.linkedin.com/in/scsmithr/">Sean
Smith</a> (<a
href="https://github.com/scsmithr">@scsmithr</a>) is the founder of
<a href="https://glaredb.com/">GlareDB</a> focused on building a
fast analytics database.</li>
+</ul>
+<!-- Footnotes themselves at the bottom. -->
+<h2>Footnotes</h2>
+<p><em>1</em>: Actual Time:
<code>30:35</code></p>
+<p><em>2</em>: It is possible to embed the dbgen code, which
appears to be the approach taken by DuckDB. This approach was tried in GlareDB
(<a
href="https://github.com/GlareDB/glaredb/pull/3313">GlareDB/glaredb#3313</a>),
but ultimately shelved given the amount of effort needed to adapt and isolate
the dbgen code.</p>
+<p><em>3</em>: It is pretty amazing to imagine the machine
required to generate SF300 that had 1.8TB (!!) of RAM</p>
+<p><em>4</em>: We tried to <a
href="https://github.com/clflushopt/tpchgen-rs/pull/34">use Rayon (see
discussion here)</a>, but could not easily keep memory bounded.</p>
+<p><em>5</em>: <code>tpchgen-cli</code> memory
usage is a function of the number of threads: each thread needs some buffer
space</p></content><category
term="blog"></category></entry><entry><title>Apache DataFusion Python 46.0.0
Released</title><link
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0"
rel="alternate"></link><published>2025-03-30T00:00:00+00:00</published><updated>2025-03-30T00:00:00+00:00</updated><author><na
[...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
b/output/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
new file mode 100644
index 0000000..21b8660
--- /dev/null
+++ b/output/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
@@ -0,0 +1,470 @@
+<?xml version="1.0" encoding="utf-8"?>
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
Andrew Lamb, Achraf B, and Sean Smith</title><link
href="https://datafusion.apache.org/blog/" rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
World’s fastest open source TPC-H data generator, [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+
with dbgen</strong>.</p>
+<p>3 members of the <a
href="https://datafusion.apache.org/">Apache DataFusion</a> community
used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a
fully open TPC-H data generator over …</p></summary><content
type="html"><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+
with dbgen</strong>.</p>
+<p>3 members of the <a
href="https://datafusion.apache.org/">Apache DataFusion</a> community
used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a
fully open TPC-H data generator over 20x
+faster than any other implementation we know of.</p>
+<p>It is now possible to create the TPC-H SF=100 dataset in 72.23
seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic
<code>dbgen</code>
+which takes 30 minutes<sup>1</sup> (0.05GB/sec). On the same
machine, it takes less than
+2 minutes to create all 3.6 GB of SF=100 in <a
href="https://parquet.apache.org/">Apache Parquet</a> format, which
takes 44 minutes using <a href="https://duckdb.org">DuckDB</a>.
+It is finally convenient and efficient to run TPC-H queries locally when
testing
+analytical engines such as DataFusion.</p>
+<p><img alt="Time to create TPC-H parquet dataset for Scale Factor
1, 10, 100 and 1000" class="img-responsive"
src="/blog/images/fastest-tpch-generator/parquet-performance.png"
width="80%"/></p>
+<p><strong>Figure 1</strong>: Time to create TPC-H dataset
for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100
<code>tpchgen</code> takes 1 minute and 14 seconds and
+<a href="https://duckdb.org">DuckDB</a> takes 17 minutes and 48
seconds. For SF=1000, <code>tpchgen</code> takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&rsquo;s time as it <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">requires
647 GB of RAM</a>, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+<a
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md">documentation</a>.</p>
+<p>This blog explains what TPC-H is, how we ported the vintage C data
generator to
+Rust (yes, <a
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/">RWIR</a>)
and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data
in
+<a href="https://datafusion.apache.org/">Apache DataFusion</a> and
<a href="https://glaredb.com/">GlareDB</a>.</p>
+<h1>Try it for yourself</h1>
+<p>The tool is entirely open source under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0
license</a>. Visit the <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs
repository</a> or try it for yourself by run the following commands after
<a href="https://www.rust-lang.org/tools/install">installing
Rust</a>:</p>
+<pre><code class="language-shell">$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+</code></pre>
+<h1>What is TPC-H / dbgen?</h1>
+<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a>
benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on <a
href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>
queries, the kind used to build BI
+dashboards.</p>
+<p>TPC-H has become a de facto standard for analytic systems. While
there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well
+known</a> limitations as the data and queries do not well represent many
real world
+use cases, the majority of analytic database papers and industrial systems
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for &ldquo;<code>TPCH Performance &lt;your
favorite database&gt;</code>&rdquo; in any
+search engine.</p>
+<p>The benchmark was created at a time when access to high performance
analytical
+systems was not widespread, so the <a
href="https://www.tpc.org/">Transaction Processing Performance
Council</a>
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users
to
+run and verify TPC-H performance themselves.</p>
+<p>TPC-H simulates a business environment with eight tables:
<code>REGION</code>, <code>NATION</code>,
+<code>SUPPLIER</code>, <code>CUSTOMER</code>,
<code>PART</code>, <code>PARTSUPP</code>,
<code>ORDERS</code>, and <code>LINEITEM</code>. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.</p>
+<p>The queries run against data created with <code><a
href="https://github.com/electrum/tpch-dbgen">dbgen</a></code>,
a program
+written in a pre <a
href="https://en.wikipedia.org/wiki/C99">C-99</a> dialect, which
generates data in a format called <em>TBL</em>
+(example in Figure 2). <code>dbgen</code> creates data for each of
the 8 tables for a
+certain <em>Scale Factor</em>, commonly abbreviated as SF. Example
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.</p>
+<pre><code
class="language-text">103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
accou|
+229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
pending |
+263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
express fo|
+327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR|
asymptotes are fu|
+450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve.
asymptote|
+</code></pre>
+<p><strong>Figure 2</strong>: Example TBL formatted output
of <code>dbgen</code> for the <code>LINEITEM</code>
table</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Data Size (TBL)</strong>
+</td>
+<td><strong>Data Size (Parquet)</strong>
+</td>
+</tr>
+<tr>
+<td>0.1
+ </td>
+<td>103 Mb
+ </td>
+<td>31 Mb
+ </td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>1 Gb
+ </td>
+<td>340 Mb
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>10 Gb
+ </td>
+<td>3.6 Gb
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>107 Gb
+ </td>
+<td>38 Gb
+ </td>
+</tr>
+<tr>
+<td>1000
+ </td>
+<td>1089 Gb
+ </td>
+<td>379 Gb
+ </td>
+</tr>
+</table>
+<p><strong>Table 1</strong>: TPC-H data set sizes at
different scale factors for both TBL and <a
href="https://parquet.apache.org/">Apache Parquet</a>.</p>
+<h1>Why do we need a new TPC-H Data generator?</h1>
+<p>Despite the known limitations of the TPC-H benchmark, it is so well
known that it
+is used frequently in database performance analysis. To run TPC-H, you must
first
+load the data, using <code>dbgen</code>, which is not ideal for
several reasons:</p>
+<ol>
+<li>You must find and compile a copy of the 15+ year old C program (for
example <a
href="https://github.com/electrum/tpch-dbgen">electrum/tpch-dbgen</a>)</li>
+<li><code>dbgen</code> requires substantial time (Figure 3)
and is not able to use more than one core.</li>
+<li>It outputs TBL format, which typically requires loading into your
database (for example, <a
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124">here
is how to do so</a> in Apache DataFusion) prior to query.</li>
+<li>The implementation makes substantial assumptions about the operating
environment, making it difficult to extend or embed into other
systems.<sup>2</sup></li>
+</ol>
+<p><img alt="Time to generate TPC-H data in TBL format"
class="img-responsive"
src="/blog/images/fastest-tpch-generator/tbl-performance.png"
width="80%"/></p>
+<p><strong>Figure 3</strong>: Time to generate TPC-H data in
TBL format. <code>tpchgen</code> is
+shown in blue. <code>tpchgen</code> restricted to a single core is
shown in red. Unmodified
+<code>dbgen</code> is shown in green and
<code>dbgen</code> modified to use <code>-O3</code>
optimization level
+is shown in yellow.</p>
+<p><code>dbgen</code> is so inconvenient and takes so long
that vendors often provide
+preloaded TPC-H data, for example <a
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch">Snowflake
Sample Data</a>, <a
href="https://docs.databricks.com/aws/en/discover/databricks-datasets">Databricks
Sample
+datasets</a> and <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">DuckDB
Pre-Generated Data Sets</a>.</p>
+<p>In addition to pre-generated datasets, DuckDB also provides a <a
href="https://duckdb.org/docs/stable/extensions/tpch.html">TPC-H
extension</a>
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, <a
href="https://github.com/lmwnshn">Wan Shen
+Lim</a> explicitly <a
href="https://github.com/apache/datafusion/issues/14373">mentioned the ease
of creating the TPC-H dataset</a> as one reason
+the first student project of <a
href="https://15799.courses.cs.cmu.edu/spring2025/">CMU-799 Spring
2025</a> used DuckDB.</p>
+<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for
several reasons:</p>
+<ol>
+<li>Creates data in a proprietary format, which requires export to use
in other systems.</li>
+<li>Requires significant time (e.g. 17 minutes for Scale Factor
10).</li>
+<li>Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale
Factor 10)</li>
+</ol>
+<p>The above limitations makes it impractical to generate Scale Factor
100 and
+above on laptops or standard workstations, though DuckDB offers <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">pre-computed
+files</a> for larger factors<sup>3</sup>.</p>
+<h1>Why Rust?</h1>
+<p>Realistically we used Rust because we wanted to integrate the data
generator
+into <a href="https://datafusion.apache.org/">Apache
DataFusion</a> and <a
href="https://glaredb.com/">GlareDB</a>. However, we also believe Rust
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.</p>
+<p>While Rust does allow unsafe access to memory (eliding bounds
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only <a
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;type=code">unsafe</a>
code is used to <a
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72">skip</a>
UTF8 validation on known ASCII
+strings.</p>
+<p><img alt="Lamb Theory on Evolution of Systems Languages"
class="img-responsive"
src="/blog/images/fastest-tpch-generator/lamb-theory.png"
width="80%"/></p>
+<p><strong>Figure 4</strong>: Lamb Theory of System Language
Evolution from <a
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf">Boston
University
+MiDAS Fall 2024 (Data Systems Seminar)</a>, <a
href="https://www.youtube.com/watch?v=CpnxuBwHbUc">recording</a>.
Special
+thanks to <a
href="https://x.com/KurtFehlhauer">@KurtFehlhauer</a></p>
+<h1>How: The Journey</h1>
+<p>We did it together as a team in the open over the course of a few
weeks.
+<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the
project by pointing out the benefits of <a
href="https://github.com/apache/datafusion/issues/14373">easy TPC-H
+dataset creation</a> and <a
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600">suggesting
we check out a Java port on February 11,
+2025</a>. Achraf made <a
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0">first
commit a few days later</a> on February 16, and <a
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e">Andrew
+and Sean started helping on March 8, 2025</a> and we <a
href="https://crates.io/crates/tpchgen/0.1.0">released version 0.1</a>
on
+March 30, 2025.</p>
+<h2>Optimizing Single Threaded Performance</h2>
+<p>Archaf <a
href="https://github.com/clflushopt/tpchgen-rs/pull/16">completed the end to
end conformance tests</a>, to ensure correctness, and
+an initial <a
href="https://github.com/clflushopt/tpchgen-rs/pull/12">cli check
in</a> on March 15, 2025.</p>
+<p>On a Macbook Pro M3 (Nov 2023), the initial performance numbers were
actually
+slower than the original Java implementation which was ported 😭. This
wasn&rsquo;t
+surprising since the focus of the first version was to get a byte of byte
+compatible port, and knew about the performance shortcomings and how to
approach
+them.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m10.307s
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>1m26.530s
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>14m56.986s
+ </td>
+</tr>
+</table>
+<p><strong>Table 2</strong>: Performance of running <a
href="https://github.com/clflushopt/tpchgen-rs/pull/12">the initial
tpchgen-cli</a>, measured with
+<code>time target/release/tpchgen-cli -s
$SCALE_FACTOR</code></p>
+<p>With this strong foundation we began optimizing the code using
Rust&rsquo;s low level
+memory management to improve performance while retaining memory safely. We
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:</p>
+<ul>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/19">Avoiding startup
overhead</a>,</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/26">not</a> <a
href="https://github.com/clflushopt/tpchgen-rs/pull/32">copying</a>
strings (many more PRs as well)</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/27">Rust&rsquo;s
zero overhead abstractions for dates</a></li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/35">Static
strings</a> (entirely safely with static lifetimes)</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/33">Generics to avoid
virtual function call overhead</a></li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/62">Moving lookups from
runtime</a> to load time</li>
+</ul>
+<p>At the time of writing, single threaded performance is now 2.5x-2.7x
faster than the initial version, as shown in Table 3.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m4.079s
+ </td>
+<td>2.5x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m31.616s
+ </td>
+<td>2.7x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>5m28.083s
+ </td>
+<td>2.7x
+ </td>
+</tr>
+</table>
+<p><strong>Table 3</strong>: Single threaded
<code>tpchgen-cli</code> performance, measured with
<code>time target/release/tpchgen-cli -s $SCALE_FACTOR
--num-threads=1</code></p>
+<h2>Multi-threading</h2>
+<p>Then we applied <a
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html">Rust&rsquo;s
fearless concurrency</a> &ndash; with a single, <a
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499">small
PR</a> (272
+net new lines) we updated the same memory safe code to run with multiple
threads
+and consume bounded memory using <a
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/">tokio
for the thread scheduler</a><sup>4</sup>.</p>
+<p>As shown in Table 4, with this change, tpchgen-cli generates the full
SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+<code>dbgen</code> or <code>duckdb</code>) &ndash;
it creates data <strong>faster than can be written to an
SSD</strong>.
+When writing to <code>/dev/null</code> tpchgen generates the
entire dataset in 25 seconds
+(4 GB/s).</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster than initial implementation</strong>
+</td>
+<td><strong>Times faster than optimized single
threaded</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m1.369s
+ </td>
+<td>7.3x
+ </td>
+<td>3x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m3.828s
+ </td>
+<td>22.6x
+ </td>
+<td>8.2x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>0m32.615s
+ </td>
+<td>27.5x
+ </td>
+<td>10x
+ </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+ </td>
+<td>0m25.088s
+ </td>
+<td>35.7x
+ </td>
+<td>13.1x
+ </td>
+</tr>
+</table>
+<p><strong>Table 4</strong>: tpchgen-cli (multithreaded)
performance measured with <code>time target/release/tpchgen-cli -s
$SCALE_FACTOR</code></p>
+<p>Using Rust and async streams, the data generator is also fully
streaming: memory
+use does not increase with increasing data size / scale
factors<sup>5</sup>. The DuckDB
+generator seems to <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">require
far more memory</a> than is commonly available on
+developer laptops and memory use increases with scale factor. With
<code>tpchgen-cli</code>
+it is perfectly possible to create data for SF=10000 or larger on a machine
with
+16GB of memory (assuming sufficient storage capacity).</p>
+<h2>Direct to parquet</h2>
+<p>At this point, <code>tpchgen-cli</code> could very
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with,
because</p>
+<ol>
+<li>It has no header</li>
+<li>It is like a CSV but the delimiter is
<code>|</code></li>
+<li>Each line ends with an extra <code>|</code> delimiter
before the newline 🙄</li>
+<li>No system that we know can read them without additional
configuration.</li>
+</ol>
+<p>We next <a
href="https://github.com/clflushopt/tpchgen-rs/pull/54">added support for
CSV</a> generation (special thanks <a
href="https://github.com/niebayes">@niebayes</a> from
+Datalayers for finding and <a
href="https://github.com/clflushopt/tpchgen-rs/issues/73">fixing</a>
<a
href="https://github.com/clflushopt/tpchgen-rs/issues/65">bugs</a>)
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p>
+<p>What would be far more useful is a typed, efficient columnar format
such as
+Apache Parquet which is supported by all modern query engines. So we <a
href="https://github.com/clflushopt/tpchgen-rs/pull/71">made</a> a
+<a href="https://crates.io/crates/tpchgen-arrow">tpchgen-arrow</a>
crate to create <a href="https://arrow.apache.org/">Apache
Arrow</a> arrays directly and then <a
href="https://github.com/clflushopt/tpchgen-rs/pull/61">a small
+300 line PR</a> to feed those arrays to the <a
href="https://crates.io/crates/parquet">Rust Parquet writer</a>, again
using
+tokio for parallelized but memory bound work.</p>
+<p>This approach was simple, fast and scalable, as shown in Table 5.
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time to generate Parquet</strong>
+</td>
+<td><strong>Speed compared to tbl generation</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m1.649s
+ </td>
+<td>0.8x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m5.643s
+ </td>
+<td>0.7x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>0m45.243s
+ </td>
+<td>0.7x
+ </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+ </td>
+<td>0m45.153s
+ </td>
+<td>0.5x
+ </td>
+</tr>
+</table>
+<p><strong>Table 5</strong>:
<code>tpchgen-cli</code> Parquet generation performance measured
with <code>time
+target/release/tpchgen-cli -s $SCALE_FACTOR
--format=parquet</code></p>
+<h1>Conclusion 👊🎤</h1>
+<p>In just a few days, with some fellow database nerds and the power of
Rust, we built something 10x better than what currently exists. We hope it
inspires more research
+into analytical systems using the TPC-H dataset and that people build awesome
+things with it. For example, Sean has already added <a
href="https://github.com/GlareDB/glaredb/pull/3549">on-demand generation of
+tables to GlareDB</a>. Please consider joining us and helping out at
+<a
href="https://github.com/clflushopt/tpchgen-rs">https://github.com/clflushopt/tpchgen-rs</a>.</p>
+<p>We met while working together on Apache DataFusion in various
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to <a
href="https://datafusion.apache.org/contributor-guide/communication.html">come
join us</a>. We are in the process of integrating this into
+DataFusion (see <a
href="https://github.com/apache/datafusion/issues/14608">apache/datafusion#14608</a>)
if you are interested in helping 🎣</p>
+<h1>About the Authors:</h1>
+<ul>
+<li><a href="https://www.linkedin.com/in/andrewalamb/">Andrew
Lamb</a> (<a href="https://github.com/alamb">@alamb</a>) is a
Staff Engineer at <a
href="https://www.influxdata.com/">InfluxData</a> and a PMC member of
<a href="https://datafusion.apache.org/">Apache DataFusion</a> and
<a href="https://arrow.apache.org/">Apache Arrow</a>.</li>
+<li>Achraf B (<a
href="https://github.com/clflushopt">@clflushopt</a>) is a Software
Engineer at <a href="https://optable.co/">Optable</a> where he
works on data infrastructure.</li>
+<li><a href="https://www.linkedin.com/in/scsmithr/">Sean
Smith</a> (<a
href="https://github.com/scsmithr">@scsmithr</a>) is the founder of
<a href="https://glaredb.com/">GlareDB</a> focused on building a
fast analytics database.</li>
+</ul>
+<!-- Footnotes themselves at the bottom. -->
+<h2>Footnotes</h2>
+<p><em>1</em>: Actual Time:
<code>30:35</code></p>
+<p><em>2</em>: It is possible to embed the dbgen code, which
appears to be the approach taken by DuckDB. This approach was tried in GlareDB
(<a
href="https://github.com/GlareDB/glaredb/pull/3313">GlareDB/glaredb#3313</a>),
but ultimately shelved given the amount of effort needed to adapt and isolate
the dbgen code.</p>
+<p><em>3</em>: It is pretty amazing to imagine the machine
required to generate SF300 that had 1.8TB (!!) of RAM</p>
+<p><em>4</em>: We tried to <a
href="https://github.com/clflushopt/tpchgen-rs/pull/34">use Rayon (see
discussion here)</a>, but could not easily keep memory bounded.</p>
+<p><em>5</em>: <code>tpchgen-cli</code> memory
usage is a function of the number of threads: each thread needs some buffer
space</p></content><category term="blog"></category></entry></feed>
\ No newline at end of file
diff --git a/output/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
b/output/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
new file mode 100644
index 0000000..616048a
--- /dev/null
+++ b/output/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
@@ -0,0 +1,30 @@
+<?xml version="1.0" encoding="utf-8"?>
+<rss version="2.0"><channel><title>Apache DataFusion Blog - Andrew Lamb,
Achraf B, and Sean
Smith</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs World’s
fastest open source TPC-H data generator, written in
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+
with dbgen</strong>.</p>
+<p>3 members of the <a
href="https://datafusion.apache.org/">Apache DataFusion</a> community
used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a
fully open TPC-H data generator over …</p></description><dc:creator
xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Lamb, Achraf B, and Sean
Smith</dc:creator><pubDate>Thu, 10 Apr 2025 00:00:00 +0000</pubDate><guid
isPermaLink="false">tag:datafusion.apache.org,2025-04-10:/blog/2025/04/10/fastest-tpch-generator</guid><category>blog</category></item></channel></rss>
\ No newline at end of file
diff --git a/output/feeds/blog.atom.xml b/output/feeds/blog.atom.xml
index dc80333..c3fdf9f 100644
--- a/output/feeds/blog.atom.xml
+++ b/output/feeds/blog.atom.xml
@@ -1,5 +1,473 @@
<?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-30T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
DataFusion Python 46.0.0 Released</title><link
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-4 [...]
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
World’s fastest open source TPC-H data generator, written in Rust</title><link
href="https://datafusion.apache.org [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+
with dbgen</strong>.</p>
+<p>3 members of the <a
href="https://datafusion.apache.org/">Apache DataFusion</a> community
used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a
fully open TPC-H data generator over …</p></summary><content
type="html"><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+
with dbgen</strong>.</p>
+<p>3 members of the <a
href="https://datafusion.apache.org/">Apache DataFusion</a> community
used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a
fully open TPC-H data generator over 20x
+faster than any other implementation we know of.</p>
+<p>It is now possible to create the TPC-H SF=100 dataset in 72.23
seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic
<code>dbgen</code>
+which takes 30 minutes<sup>1</sup> (0.05GB/sec). On the same
machine, it takes less than
+2 minutes to create all 3.6 GB of SF=100 in <a
href="https://parquet.apache.org/">Apache Parquet</a> format, which
takes 44 minutes using <a href="https://duckdb.org">DuckDB</a>.
+It is finally convenient and efficient to run TPC-H queries locally when
testing
+analytical engines such as DataFusion.</p>
+<p><img alt="Time to create TPC-H parquet dataset for Scale Factor
1, 10, 100 and 1000" class="img-responsive"
src="/blog/images/fastest-tpch-generator/parquet-performance.png"
width="80%"/></p>
+<p><strong>Figure 1</strong>: Time to create TPC-H dataset
for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100
<code>tpchgen</code> takes 1 minute and 14 seconds and
+<a href="https://duckdb.org">DuckDB</a> takes 17 minutes and 48
seconds. For SF=1000, <code>tpchgen</code> takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB&rsquo;s time as it <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">requires
647 GB of RAM</a>, more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+<a
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md">documentation</a>.</p>
+<p>This blog explains what TPC-H is, how we ported the vintage C data
generator to
+Rust (yes, <a
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/">RWIR</a>)
and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data
in
+<a href="https://datafusion.apache.org/">Apache DataFusion</a> and
<a href="https://glaredb.com/">GlareDB</a>.</p>
+<h1>Try it for yourself</h1>
+<p>The tool is entirely open source under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0
license</a>. Visit the <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs
repository</a> or try it for yourself by run the following commands after
<a href="https://www.rust-lang.org/tools/install">installing
Rust</a>:</p>
+<pre><code class="language-shell">$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+</code></pre>
+<h1>What is TPC-H / dbgen?</h1>
+<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a>
benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on <a
href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>
queries, the kind used to build BI
+dashboards.</p>
+<p>TPC-H has become a de facto standard for analytic systems. While
there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well
+known</a> limitations as the data and queries do not well represent many
real world
+use cases, the majority of analytic database papers and industrial systems
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for &ldquo;<code>TPCH Performance &lt;your
favorite database&gt;</code>&rdquo; in any
+search engine.</p>
+<p>The benchmark was created at a time when access to high performance
analytical
+systems was not widespread, so the <a
href="https://www.tpc.org/">Transaction Processing Performance
Council</a>
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users
to
+run and verify TPC-H performance themselves.</p>
+<p>TPC-H simulates a business environment with eight tables:
<code>REGION</code>, <code>NATION</code>,
+<code>SUPPLIER</code>, <code>CUSTOMER</code>,
<code>PART</code>, <code>PARTSUPP</code>,
<code>ORDERS</code>, and <code>LINEITEM</code>. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.</p>
+<p>The queries run against data created with <code><a
href="https://github.com/electrum/tpch-dbgen">dbgen</a></code>,
a program
+written in a pre <a
href="https://en.wikipedia.org/wiki/C99">C-99</a> dialect, which
generates data in a format called <em>TBL</em>
+(example in Figure 2). <code>dbgen</code> creates data for each of
the 8 tables for a
+certain <em>Scale Factor</em>, commonly abbreviated as SF. Example
Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.</p>
+<pre><code
class="language-text">103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
accou|
+229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
pending |
+263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
express fo|
+327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR|
asymptotes are fu|
+450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve.
asymptote|
+</code></pre>
+<p><strong>Figure 2</strong>: Example TBL formatted output
of <code>dbgen</code> for the <code>LINEITEM</code>
table</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Data Size (TBL)</strong>
+</td>
+<td><strong>Data Size (Parquet)</strong>
+</td>
+</tr>
+<tr>
+<td>0.1
+ </td>
+<td>103 Mb
+ </td>
+<td>31 Mb
+ </td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>1 Gb
+ </td>
+<td>340 Mb
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>10 Gb
+ </td>
+<td>3.6 Gb
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>107 Gb
+ </td>
+<td>38 Gb
+ </td>
+</tr>
+<tr>
+<td>1000
+ </td>
+<td>1089 Gb
+ </td>
+<td>379 Gb
+ </td>
+</tr>
+</table>
+<p><strong>Table 1</strong>: TPC-H data set sizes at
different scale factors for both TBL and <a
href="https://parquet.apache.org/">Apache Parquet</a>.</p>
+<h1>Why do we need a new TPC-H Data generator?</h1>
+<p>Despite the known limitations of the TPC-H benchmark, it is so well
known that it
+is used frequently in database performance analysis. To run TPC-H, you must
first
+load the data, using <code>dbgen</code>, which is not ideal for
several reasons:</p>
+<ol>
+<li>You must find and compile a copy of the 15+ year old C program (for
example <a
href="https://github.com/electrum/tpch-dbgen">electrum/tpch-dbgen</a>)</li>
+<li><code>dbgen</code> requires substantial time (Figure 3)
and is not able to use more than one core.</li>
+<li>It outputs TBL format, which typically requires loading into your
database (for example, <a
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124">here
is how to do so</a> in Apache DataFusion) prior to query.</li>
+<li>The implementation makes substantial assumptions about the operating
environment, making it difficult to extend or embed into other
systems.<sup>2</sup></li>
+</ol>
+<p><img alt="Time to generate TPC-H data in TBL format"
class="img-responsive"
src="/blog/images/fastest-tpch-generator/tbl-performance.png"
width="80%"/></p>
+<p><strong>Figure 3</strong>: Time to generate TPC-H data in
TBL format. <code>tpchgen</code> is
+shown in blue. <code>tpchgen</code> restricted to a single core is
shown in red. Unmodified
+<code>dbgen</code> is shown in green and
<code>dbgen</code> modified to use <code>-O3</code>
optimization level
+is shown in yellow.</p>
+<p><code>dbgen</code> is so inconvenient and takes so long
that vendors often provide
+preloaded TPC-H data, for example <a
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch">Snowflake
Sample Data</a>, <a
href="https://docs.databricks.com/aws/en/discover/databricks-datasets">Databricks
Sample
+datasets</a> and <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">DuckDB
Pre-Generated Data Sets</a>.</p>
+<p>In addition to pre-generated datasets, DuckDB also provides a <a
href="https://duckdb.org/docs/stable/extensions/tpch.html">TPC-H
extension</a>
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
+the current alternatives that it leads many researchers and other thought
+leaders to use DuckDB to evaluate new ideas. For example, <a
href="https://github.com/lmwnshn">Wan Shen
+Lim</a> explicitly <a
href="https://github.com/apache/datafusion/issues/14373">mentioned the ease
of creating the TPC-H dataset</a> as one reason
+the first student project of <a
href="https://15799.courses.cs.cmu.edu/spring2025/">CMU-799 Spring
2025</a> used DuckDB.</p>
+<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for
several reasons:</p>
+<ol>
+<li>Creates data in a proprietary format, which requires export to use
in other systems.</li>
+<li>Requires significant time (e.g. 17 minutes for Scale Factor
10).</li>
+<li>Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale
Factor 10)</li>
+</ol>
+<p>The above limitations makes it impractical to generate Scale Factor
100 and
+above on laptops or standard workstations, though DuckDB offers <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">pre-computed
+files</a> for larger factors<sup>3</sup>.</p>
+<h1>Why Rust?</h1>
+<p>Realistically we used Rust because we wanted to integrate the data
generator
+into <a href="https://datafusion.apache.org/">Apache
DataFusion</a> and <a
href="https://glaredb.com/">GlareDB</a>. However, we also believe Rust
is
+superior to C/C++ due to its comparable performance, but much higher programmer
+productivity (Figure 4). Productivity in this case refers to the ease of
+optimizing and adding multithreading without introducing hard to debug memory
+safety or concurrency issues.</p>
+<p>While Rust does allow unsafe access to memory (eliding bounds
checking, for
+example), when required for performance, our implementation is entirely memory
+safe. The only <a
href="https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&amp;type=code">unsafe</a>
code is used to <a
href="https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72">skip</a>
UTF8 validation on known ASCII
+strings.</p>
+<p><img alt="Lamb Theory on Evolution of Systems Languages"
class="img-responsive"
src="/blog/images/fastest-tpch-generator/lamb-theory.png"
width="80%"/></p>
+<p><strong>Figure 4</strong>: Lamb Theory of System Language
Evolution from <a
href="https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf">Boston
University
+MiDAS Fall 2024 (Data Systems Seminar)</a>, <a
href="https://www.youtube.com/watch?v=CpnxuBwHbUc">recording</a>.
Special
+thanks to <a
href="https://x.com/KurtFehlhauer">@KurtFehlhauer</a></p>
+<h1>How: The Journey</h1>
+<p>We did it together as a team in the open over the course of a few
weeks.
+<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the
project by pointing out the benefits of <a
href="https://github.com/apache/datafusion/issues/14373">easy TPC-H
+dataset creation</a> and <a
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600">suggesting
we check out a Java port on February 11,
+2025</a>. Achraf made <a
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0">first
commit a few days later</a> on February 16, and <a
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e">Andrew
+and Sean started helping on March 8, 2025</a> and we <a
href="https://crates.io/crates/tpchgen/0.1.0">released version 0.1</a>
on
+March 30, 2025.</p>
+<h2>Optimizing Single Threaded Performance</h2>
+<p>Archaf <a
href="https://github.com/clflushopt/tpchgen-rs/pull/16">completed the end to
end conformance tests</a>, to ensure correctness, and
+an initial <a
href="https://github.com/clflushopt/tpchgen-rs/pull/12">cli check
in</a> on March 15, 2025.</p>
+<p>On a Macbook Pro M3 (Nov 2023), the initial performance numbers were
actually
+slower than the original Java implementation which was ported 😭. This
wasn&rsquo;t
+surprising since the focus of the first version was to get a byte of byte
+compatible port, and knew about the performance shortcomings and how to
approach
+them.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m10.307s
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>1m26.530s
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>14m56.986s
+ </td>
+</tr>
+</table>
+<p><strong>Table 2</strong>: Performance of running <a
href="https://github.com/clflushopt/tpchgen-rs/pull/12">the initial
tpchgen-cli</a>, measured with
+<code>time target/release/tpchgen-cli -s
$SCALE_FACTOR</code></p>
+<p>With this strong foundation we began optimizing the code using
Rust&rsquo;s low level
+memory management to improve performance while retaining memory safely. We
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:</p>
+<ul>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/19">Avoiding startup
overhead</a>,</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/26">not</a> <a
href="https://github.com/clflushopt/tpchgen-rs/pull/32">copying</a>
strings (many more PRs as well)</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/27">Rust&rsquo;s
zero overhead abstractions for dates</a></li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/35">Static
strings</a> (entirely safely with static lifetimes)</li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/33">Generics to avoid
virtual function call overhead</a></li>
+<li><a
href="https://github.com/clflushopt/tpchgen-rs/pull/62">Moving lookups from
runtime</a> to load time</li>
+</ul>
+<p>At the time of writing, single threaded performance is now 2.5x-2.7x
faster than the initial version, as shown in Table 3.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m4.079s
+ </td>
+<td>2.5x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m31.616s
+ </td>
+<td>2.7x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>5m28.083s
+ </td>
+<td>2.7x
+ </td>
+</tr>
+</table>
+<p><strong>Table 3</strong>: Single threaded
<code>tpchgen-cli</code> performance, measured with
<code>time target/release/tpchgen-cli -s $SCALE_FACTOR
--num-threads=1</code></p>
+<h2>Multi-threading</h2>
+<p>Then we applied <a
href="https://doc.rust-lang.org/book/ch16-00-concurrency.html">Rust&rsquo;s
fearless concurrency</a> &ndash; with a single, <a
href="https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499">small
PR</a> (272
+net new lines) we updated the same memory safe code to run with multiple
threads
+and consume bounded memory using <a
href="https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/">tokio
for the thread scheduler</a><sup>4</sup>.</p>
+<p>As shown in Table 4, with this change, tpchgen-cli generates the full
SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+<code>dbgen</code> or <code>duckdb</code>) &ndash;
it creates data <strong>faster than can be written to an
SSD</strong>.
+When writing to <code>/dev/null</code> tpchgen generates the
entire dataset in 25 seconds
+(4 GB/s).</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time</strong>
+</td>
+<td><strong>Times faster than initial implementation</strong>
+</td>
+<td><strong>Times faster than optimized single
threaded</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m1.369s
+ </td>
+<td>7.3x
+ </td>
+<td>3x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m3.828s
+ </td>
+<td>22.6x
+ </td>
+<td>8.2x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>0m32.615s
+ </td>
+<td>27.5x
+ </td>
+<td>10x
+ </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+ </td>
+<td>0m25.088s
+ </td>
+<td>35.7x
+ </td>
+<td>13.1x
+ </td>
+</tr>
+</table>
+<p><strong>Table 4</strong>: tpchgen-cli (multithreaded)
performance measured with <code>time target/release/tpchgen-cli -s
$SCALE_FACTOR</code></p>
+<p>Using Rust and async streams, the data generator is also fully
streaming: memory
+use does not increase with increasing data size / scale
factors<sup>5</sup>. The DuckDB
+generator seems to <a
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">require
far more memory</a> than is commonly available on
+developer laptops and memory use increases with scale factor. With
<code>tpchgen-cli</code>
+it is perfectly possible to create data for SF=10000 or larger on a machine
with
+16GB of memory (assuming sufficient storage capacity).</p>
+<h2>Direct to parquet</h2>
+<p>At this point, <code>tpchgen-cli</code> could very
quickly generate the TBL format.
+However, as described above, the TBL is annoying to work with,
because</p>
+<ol>
+<li>It has no header</li>
+<li>It is like a CSV but the delimiter is
<code>|</code></li>
+<li>Each line ends with an extra <code>|</code> delimiter
before the newline 🙄</li>
+<li>No system that we know can read them without additional
configuration.</li>
+</ol>
+<p>We next <a
href="https://github.com/clflushopt/tpchgen-rs/pull/54">added support for
CSV</a> generation (special thanks <a
href="https://github.com/niebayes">@niebayes</a> from
+Datalayers for finding and <a
href="https://github.com/clflushopt/tpchgen-rs/issues/73">fixing</a>
<a
href="https://github.com/clflushopt/tpchgen-rs/issues/65">bugs</a>)
which performs at the same
+speed as TBL. While CSV files are far more standard than TBL, they must still
be
+parsed prior to load and automatic type inference may not deduce the types
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p>
+<p>What would be far more useful is a typed, efficient columnar format
such as
+Apache Parquet which is supported by all modern query engines. So we <a
href="https://github.com/clflushopt/tpchgen-rs/pull/71">made</a> a
+<a href="https://crates.io/crates/tpchgen-arrow">tpchgen-arrow</a>
crate to create <a href="https://arrow.apache.org/">Apache
Arrow</a> arrays directly and then <a
href="https://github.com/clflushopt/tpchgen-rs/pull/61">a small
+300 line PR</a> to feed those arrays to the <a
href="https://crates.io/crates/parquet">Rust Parquet writer</a>, again
using
+tokio for parallelized but memory bound work.</p>
+<p>This approach was simple, fast and scalable, as shown in Table 5.
Even though
+creating Parquet files is significantly more computationally expensive than TBL
+or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than
+45 seconds.</p>
+<table>
+<tr>
+<td><strong>Scale Factor</strong>
+</td>
+<td><strong>Time to generate Parquet</strong>
+</td>
+<td><strong>Speed compared to tbl generation</strong>
+</td>
+</tr>
+<tr>
+<td>1
+ </td>
+<td>0m1.649s
+ </td>
+<td>0.8x
+ </td>
+</tr>
+<tr>
+<td>10
+ </td>
+<td>0m5.643s
+ </td>
+<td>0.7x
+ </td>
+</tr>
+<tr>
+<td>100
+ </td>
+<td>0m45.243s
+ </td>
+<td>0.7x
+ </td>
+</tr>
+<tr>
+<td>100 (to /dev/null)
+ </td>
+<td>0m45.153s
+ </td>
+<td>0.5x
+ </td>
+</tr>
+</table>
+<p><strong>Table 5</strong>:
<code>tpchgen-cli</code> Parquet generation performance measured
with <code>time
+target/release/tpchgen-cli -s $SCALE_FACTOR
--format=parquet</code></p>
+<h1>Conclusion 👊🎤</h1>
+<p>In just a few days, with some fellow database nerds and the power of
Rust, we built something 10x better than what currently exists. We hope it
inspires more research
+into analytical systems using the TPC-H dataset and that people build awesome
+things with it. For example, Sean has already added <a
href="https://github.com/GlareDB/glaredb/pull/3549">on-demand generation of
+tables to GlareDB</a>. Please consider joining us and helping out at
+<a
href="https://github.com/clflushopt/tpchgen-rs">https://github.com/clflushopt/tpchgen-rs</a>.</p>
+<p>We met while working together on Apache DataFusion in various
capacities. If you
+are looking for a community of like minded people hacking on databases, we
+welcome you to <a
href="https://datafusion.apache.org/contributor-guide/communication.html">come
join us</a>. We are in the process of integrating this into
+DataFusion (see <a
href="https://github.com/apache/datafusion/issues/14608">apache/datafusion#14608</a>)
if you are interested in helping 🎣</p>
+<h1>About the Authors:</h1>
+<ul>
+<li><a href="https://www.linkedin.com/in/andrewalamb/">Andrew
Lamb</a> (<a href="https://github.com/alamb">@alamb</a>) is a
Staff Engineer at <a
href="https://www.influxdata.com/">InfluxData</a> and a PMC member of
<a href="https://datafusion.apache.org/">Apache DataFusion</a> and
<a href="https://arrow.apache.org/">Apache Arrow</a>.</li>
+<li>Achraf B (<a
href="https://github.com/clflushopt">@clflushopt</a>) is a Software
Engineer at <a href="https://optable.co/">Optable</a> where he
works on data infrastructure.</li>
+<li><a href="https://www.linkedin.com/in/scsmithr/">Sean
Smith</a> (<a
href="https://github.com/scsmithr">@scsmithr</a>) is the founder of
<a href="https://glaredb.com/">GlareDB</a> focused on building a
fast analytics database.</li>
+</ul>
+<!-- Footnotes themselves at the bottom. -->
+<h2>Footnotes</h2>
+<p><em>1</em>: Actual Time:
<code>30:35</code></p>
+<p><em>2</em>: It is possible to embed the dbgen code, which
appears to be the approach taken by DuckDB. This approach was tried in GlareDB
(<a
href="https://github.com/GlareDB/glaredb/pull/3313">GlareDB/glaredb#3313</a>),
but ultimately shelved given the amount of effort needed to adapt and isolate
the dbgen code.</p>
+<p><em>3</em>: It is pretty amazing to imagine the machine
required to generate SF300 that had 1.8TB (!!) of RAM</p>
+<p><em>4</em>: We tried to <a
href="https://github.com/clflushopt/tpchgen-rs/pull/34">use Rayon (see
discussion here)</a>, but could not easily keep memory bounded.</p>
+<p><em>5</em>: <code>tpchgen-cli</code> memory
usage is a function of the number of threads: each thread needs some buffer
space</p></content><category
term="blog"></category></entry><entry><title>Apache DataFusion Python 46.0.0
Released</title><link
href="https://datafusion.apache.org/blog/2025/03/30/datafusion-python-46.0.0"
rel="alternate"></link><published>2025-03-30T00:00:00+00:00</published><updated>2025-03-30T00:00:00+00:00</updated><author><na
[...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/images/fastest-tpch-generator/lamb-theory.png
b/output/images/fastest-tpch-generator/lamb-theory.png
new file mode 100644
index 0000000..2551ffa
Binary files /dev/null and
b/output/images/fastest-tpch-generator/lamb-theory.png differ
diff --git a/output/images/fastest-tpch-generator/parquet-performance.png
b/output/images/fastest-tpch-generator/parquet-performance.png
new file mode 100644
index 0000000..462c995
Binary files /dev/null and
b/output/images/fastest-tpch-generator/parquet-performance.png differ
diff --git a/output/images/fastest-tpch-generator/tbl-performance.png
b/output/images/fastest-tpch-generator/tbl-performance.png
new file mode 100644
index 0000000..2e64f11
Binary files /dev/null and
b/output/images/fastest-tpch-generator/tbl-performance.png differ
diff --git a/output/index.html b/output/index.html
index c10e06d..6d3d880 100644
--- a/output/index.html
+++ b/output/index.html
@@ -44,6 +44,53 @@
<p><i>Here you can find the latest updates from DataFusion and
related projects.</i></p>
+ <!-- Post -->
+ <div class="row">
+ <div class="callout">
+ <article class="post">
+ <header>
+ <div class="title">
+ <h1><a
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open
source TPC-H data generator, written in Rust</a></h1>
+ <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf
B, and Sean Smith</p>
+ <p><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+<style>
+/* Table borders */
+table, th, td {
+ border: 1px solid black;
+ border-collapse: collapse;
+}
+th, td {
+ padding: 3px;
+}
+</style>
+<p><strong>TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with
dbgen</strong>.</p>
+<p>3 members of the <a href="https://datafusion.apache.org/">Apache
DataFusion</a> community used Rust and open source
+development to build <a
href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs</a>, a fully open
TPC-H data generator over …</p></p>
+ <footer>
+ <ul class="actions">
+ <div style="text-align: right"><a
href="/blog/2025/04/10/fastest-tpch-generator" class="button medium">Continue
Reading</a></div>
+ </ul>
+ <ul class="stats">
+ </ul>
+ </footer>
+ </article>
+ </div>
+ </div>
<!-- Post -->
<div class="row">
<div class="callout">
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]