This is an automated email from the ASF dual-hosted git repository. alamb pushed a commit to branch site/tpch_data_generator in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
commit 503401b3fb4a7371e1575b122ef09f1cad36a9e0 Author: Andrew Lamb <[email protected]> AuthorDate: Fri Apr 4 15:53:37 2025 -0400 Complete port --- content/blog/2025-04-10-fastest-tpch-generator.md | 246 +++++++++++++++------- 1 file changed, 167 insertions(+), 79 deletions(-) diff --git a/content/blog/2025-04-10-fastest-tpch-generator.md b/content/blog/2025-04-10-fastest-tpch-generator.md index 84a83c8..ddb3f22 100644 --- a/content/blog/2025-04-10-fastest-tpch-generator.md +++ b/content/blog/2025-04-10-fastest-tpch-generator.md @@ -23,37 +23,60 @@ limitations under the License. {% endcomment %} --> +<style> +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} +</style> + We used Rust and open source development to build [tpchgen-rs](https://github.com/alamb/tpchgen-rs), a fully open TPCH data generator over 10x faster than any other such generator we know of. Authors: * [Andrew Lamb](https://www.linkedin.com/in/andrewalamb/) ([@alamb](https://github.com/alamb)) is a Staff Engineer at [InfluxData](https://www.influxdata.com/) and an [Apache DataFusion](https://datafusion.apache.org/) and Apache Arrow PMC member. * Achraf B ([@clflushopt](https://github.com/clflushopt)) is a Software Engineer at [Optable](https://optable.co/) where he works on data infrastructure. -* [Sean Smith](https://www.linkedin.com/in/scsmithr/) ([@scsmithr](https://github.com/scsmithr)) is the founder of [GlareDB](https://glaredb.com/) where focused on building a fast analytics database. - -It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than 2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet](https://parquet.apache.org/) format. - -Finally, it is convenient and efficient to run TPCH queries locally when testing analytical engines. - - -<p id="gdcalert1" ><span style="color: red; font-weight: bold">>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary. </span><br>(<a href="#">Back to top</a>)(<a href="#gdcalert2">Next alert</a>)<br><span style="color: red; font-weight: bold">>>>>> </span></p> - - - - - -**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and [DuckDB](https://duckdb.org/) takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure DuckDB’s time as it [requires 647 GB of RAM](https://duckdb.org/docs/stable/extensions/tp [...] - - - -This blog explains what TPCH is, how we ported the vintage C data generator to Rust (yes, [RWIR](https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/)), and optimized its performance over the course of a few weeks of part time work. We began this project so we can easily generate TPCH data in [Apache DataFusion](https://datafusion.apache.org/) and [GlareDB](https://glaredb.com/). - +* [Sean Smith](https://www.linkedin.com/in/scsmithr/) ([@scsmithr](https://github.com/scsmithr)) is the founder of [GlareDB](https://glaredb.com/) focused on building a fast analytics database. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache +Parquet](https://parquet.apache.org/) format. + +Finally, it is convenient and efficient to run TPCH queries locally when testing +analytical engines such as DataFusion. + +<img src="/blog/images/fastest-tpch-generator/parquet-performance.png" alt="Time to create TPCH parquet dataset for Scale Factor 1, 10, 100 and 1000" width="80%" class="img-responsive"> + +**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB](https://duckdb.org/) takes 17 minutes and 48 seconds. For SF=1000, +`tpchgen` takes 10 minutes and 26 and uses about 5 GB of RAM at peak, and we +could not measure DuckDB’s time as it [requires 647 GB of +RAM](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator), +more than the 88 GB that was available on our test machine. The testing +methodology is in the +[documentation](https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md). + +This blog explains what TPCH is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part time work. We began this project so we can easily generate TPCH data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ # Try if for yourself The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): - -``` +```shell $ cargo install tpchgen-cli # create SF=1 in classic TBL format @@ -63,22 +86,42 @@ $ tpchgen-cli -s 1 $ tpchgen-cli -s 10 --format=parquet ``` - - # What is TPCH / dbgen? -The popular [TPC-H](https://www.tpc.org/tpch/) benchmark (commonly referred to as TPCH) helps evaluate the performance of database systems on [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) queries*, *the kind used to build BI dashboards. - -TPCH has become a de facto standard for analytic systems. While there are [well known](https://www.vldb.org/pvldb/vol9/p204-leis.pdf) limitations as the data and queries do not well represent many real world use cases, the majority of analytic database papers and industrial systems still use TPCH query performance benchmarks as a baseline. You will inevitably find multiple results for “`TPCH Performance <your favorite database>`” in any search engine. - -The benchmark was created at a time when access to high performance analytical systems was not widespread, so the [Transaction Processing Performance Council](https://www.tpc.org/) defined a process of formal result verification. More recently, given the broad availability of free and open source database systems, it is common for users to run and verify TPCH performance themselves. - -TPCH simulates a business environment with eight tables: `REGION`, `NATION`, `SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These tables are linked by foreign keys in a normalized schema representing a supply chain with parts, suppliers, customers and orders. The benchmark itself is 22 SQL queries containing joins, aggregations, and sorting operations. - -The queries run against data created with <code>[dbgen](https://github.com/electrum/tpch-dbgen)</code>, a program written in a pre [C-99](https://en.wikipedia.org/wiki/C99) dialect, which generates data in a format called *TBL* (example in Figure 2). `dbgen` creates data for each of the 8 tables for a certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and corresponding dataset sizes are shown in Table 1. There is no theoretical upper bound on the Scale Factor. - - -``` +The popular [TPC-H](https://www.tpc.org/tpch/) benchmark (commonly referred to +as TPCH) helps evaluate the performance of database systems on +[OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) queries*, +*the kind used to build BI dashboards. + +TPCH has become a de facto standard for analytic systems. While there are [well +known](https://www.vldb.org/pvldb/vol9/p204-leis.pdf) limitations as the data +and queries do not well represent many real world use cases, the majority of +analytic database papers and industrial systems still use TPCH query performance +benchmarks as a baseline. You will inevitably find multiple results for “`TPCH +Performance <your favorite database>`” in any search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance +Council](https://www.tpc.org/) defined a process of formal result verification. +More recently, given the broad availability of free and open source database +systems, it is common for users to run and verify TPCH performance themselves. + +TPCH simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with +<code>[dbgen](https://github.com/electrum/tpch-dbgen)</code>, a program written +in a pre [C-99](https://en.wikipedia.org/wiki/C99) dialect, which generates data +in a format called *TBL* (example in Figure 2). `dbgen` creates data for each of +the 8 tables for a certain *Scale Factor*, commonly abbreviated as SF. Example +Scale Factors and corresponding dataset sizes are shown in Table 1. There is no +theoretical upper bound on the Scale Factor. + + +```text 103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic accou| 229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously pending | 263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully express fo| @@ -151,28 +194,43 @@ Despite the known limitations of the TPCH benchmark, it is so well known that it -1. You must find and compile a copy of the 15+ year old C program (for example [electrum/tpch-dbgen](https://github.com/electrum/tpch-dbgen)) +1. You must find and compile a copy of the 15+ year old C program (for example [electrum/tpch-dbgen]) 2. `dbgen` requires substantial time (Figure 3) and is not able to use more than one core. -3. It outputs TBL format, which typically requires loading into your database (for example, [here is how to do so](https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124) in Apache DataFusion) prior to query. +3. It outputs TBL format, which typically requires loading into your database (for example, [here is how to do so] in Apache DataFusion) prior to query. 4. The implementation makes substantial assumptions about the operating environment, making it difficult to extend or embed into other systems.[^2] - -<p id="gdcalert2" ><span style="color: red; font-weight: bold">>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary. </span><br>(<a href="#">Back to top</a>)(<a href="#gdcalert3">Next alert</a>)<br><span style="color: red; font-weight: bold">>>>>> </span></p> - - - - - -**Figure 3**: Time to generate TPCH data in TBL format. The default `tpchgen` is shown in blue. `tpchgen` restricted to a single core is shown in red. Unmodified `dbgen` is shown in green and `dbgen` modified to use `-O3` optimization level is shown in yellow. - -`dbgen` is so inconvenient and takes so long that vendors often provide pre-loaded TPCH data, for example [Snowflake Sample Data](https://docs.snowflake.com/en/user-guide/sample-data-tpch), [DataBricks Sample datasets](https://docs.databricks.com/aws/en/discover/databricks-datasets) and [DuckDB Pre-Generated Data Sets](https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets). - -In addition to pre-generated datasets, DuckDB also provides a [TPCH](https://duckdb.org/docs/stable/extensions/tpch.html) extension for generating TPCH datasets within DuckDB. This is so much easier to use than the current alternatives that it leads many researchers and other thought leaders to use DuckDB to evaluate new ideas. For example, [Wan Shen Lim](https://github.com/lmwnshn) explicitly [mentioned the ease of creating the TPCH dataset](https://github.com/apache/datafusion/issues/1 [...] +[electrum/tpch-dbgen]: https://github.com/electrum/tpch-dbgen +[here is how to do so]: https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124 + + +<img src="/blog/images/fastest-tpch-generator/tbl-performance.png" alt="Time to generate TPCH data in TBL format" width="80%" class="img-responsive"> + +**Figure 3**: Time to generate TPCH data in TBL format. The default `tpchgen` is +shown in blue. `tpchgen` restricted to a single core is shown in red. Unmodified +`dbgen` is shown in green and `dbgen` modified to use `-O3` optimization level +is shown in yellow. + +`dbgen` is so inconvenient and takes so long that vendors often provide +pre-loaded TPCH data, for example [Snowflake Sample +Data](https://docs.snowflake.com/en/user-guide/sample-data-tpch), [DataBricks +Sample +datasets](https://docs.databricks.com/aws/en/discover/databricks-datasets) and +[DuckDB Pre-Generated Data +Sets](https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets). + +In addition to pre-generated datasets, DuckDB also provides a +[TPCH](https://duckdb.org/docs/stable/extensions/tpch.html) extension for +generating TPCH datasets within DuckDB. This is so much easier to use than the +current alternatives that it leads many researchers and other thought leaders to +use DuckDB to evaluate new ideas. For example, [Wan Shen +Lim](https://github.com/lmwnshn) explicitly [mentioned the ease of creating the +TPCH dataset](https://github.com/apache/datafusion/issues/14373) as one reason +the first student project of [CMU-799 Spring +2025](https://15799.courses.cs.cmu.edu/spring2025/) used DuckDB. As beneficial as the DuckDB TPCH extension is, it is non ideal for several reasons: - 1. Creates data in a proprietary format, which requires export to use in other systems. 2. Requires significant time (e.g. 17 minutes for Scale Factor 10). 3. Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale Factor 10) @@ -188,31 +246,52 @@ While Rust does allow unsafe access to memory (eliding bounds checking, for exam -<p id="gdcalert3" ><span style="color: red; font-weight: bold">>>>>> gd2md-html alert: inline drawings not supported directly from Docs. You may want to copy the inline drawing to a standalone drawing and export by reference. See <a href="https://github.com/evbacher/gd2md-html/wiki/Google-Drawings-by-reference">Google Drawings by reference</a> for details. The img URL below is a placeholder. </span><br>(<a href="#">Back to top</a>)(<a href="#gdcalert4">Next alert</a>)<br><span style="co [...] +<img src="/blog/images/fastest-tpch-generator/lamb-theory.png" alt="Lamb Theory on Evolution of Systems Languages" width="80%" class="img-responsive"> +**Figure 4**: Lamb Theory of System Language Evolution from [Boston University +MiDAS Fall 2024 (Data Systems Seminar)] [slides(pdf)], [recording]. Special +thanks to [@KurtFehlhauer] - - -Figure 4: Lamb Theory of System Language Evolution from [Boston University: MiDAS Fall 2024 (Data Systems Seminar) ](https://midas.bu.edu/seminar.html) [slides(pdf)](https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf), [recording](https://www.youtube.com/watch?v=CpnxuBwHbUc) - +[Boston University MiDAS Fall 2024 (Data Systems Seminar)]: https://midas.bu.edu/seminar.html +[slides(pdf)]: https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf +[recording]: https://www.youtube.com/watch?v=CpnxuBwHbUc +[@KurtFehlhauer]: https://x.com/KurtFehlhauer # How: The Journey -We did it together as a team in the open over the course of a few weeks weeks. [Wan Shen Lim](https://github.com/lmwnshn) inspired the project by pointing out the benefits of [easy TPCH dataset creation](https://github.com/apache/datafusion/issues/14373) and [suggesting we check out a Java port on February 11, 2025](https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600). Achraf made [first commit a few days later](https://github.com/clflushopt/tpchgen-rs/commit/53d34 [...] +We did it together as a team in the open over the course of a few weeks weeks. +[Wan Shen Lim](https://github.com/lmwnshn) inspired the project by pointing out +the benefits of [easy TPCH dataset +creation](https://github.com/apache/datafusion/issues/14373) and [suggesting we +check out a Java port on February 11, +2025](https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600). +Achraf made [first commit a few days +later](https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0) +on February 16, and [Andrew and Sean started helping on March 8, +2025](https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e) +and we [released version 0.1](https://crates.io/crates/tpchgen/0.1.0) on March +30, 2025. ## Optimizing Single Threaded Performance -Archaf [completed the end to end conformance tests](https://github.com/clflushopt/tpchgen-rs/pull/16), to ensure correctness, and an initial [cli check in](https://github.com/clflushopt/tpchgen-rs/pull/12) on March 15, 2025. +Archaf [completed the end to end conformance +tests](https://github.com/clflushopt/tpchgen-rs/pull/16), to ensure correctness, +and an initial [cli check in](https://github.com/clflushopt/tpchgen-rs/pull/12) +on March 15, 2025. -On a Macbook Pro M3 (Nov 2023), the initial performance numbers were actually slower than the original Java implementation which was ported 😭. This wasn’t surprising since the focus of the first version was to get a byte of byte compatible port and we were aware about the performance shortcomings and how to approach them. +On a Macbook Pro M3 (Nov 2023), the initial performance numbers were actually +slower than the original Java implementation which was ported 😭. This wasn’t +surprising since the focus of the first version was to get a byte of byte +compatible port and we were aware about the performance shortcomings and how to +approach them. <table> <tr> - <td>Scale Factor + <td><strong>Scale Factor</strong> </td> - <td>Time + <td><strong>Time</strong> </td> </tr> <tr> @@ -236,10 +315,14 @@ On a Macbook Pro M3 (Nov 2023), the initial performance numbers were actually sl </table> -**Table 2**: Performance of running [the initial tpchgen-cli](https://github.com/clflushopt/tpchgen-rs/pull/12), measured with `time target/release/tpchgen-cli -s $SCALE_FACTOR ` - -With this strong foundation we began optimizing the code using Rust’s low level memory management to improve performance while retaining memory safely. We spent several days obsessing over low level details and implemented a textbook like list of optimizations: +**Table 2**: Performance of running [the initial +tpchgen-cli](https://github.com/clflushopt/tpchgen-rs/pull/12), measured with +`time target/release/tpchgen-cli -s $SCALE_FACTOR ` +With this strong foundation we began optimizing the code using Rust’s low level +memory management to improve performance while retaining memory safely. We spent +several days obsessing over low level details and implemented a textbook like +list of optimizations: * [Avoiding startup overhead](https://github.com/clflushopt/tpchgen-rs/pull/19), @@ -293,20 +376,31 @@ At the time of writing, single threaded performance is now 2.5x-2.7x faster than ## Multi-threading -Then we applied [Rust’s fearless concurrency](https://doc.rust-lang.org/book/ch16-00-concurrency.html) – with a single, [small PR (272 net new lines) ](https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499)we updated the same memory safe code to run with multiple threads and consume bounded memory using [tokio for the thread scheduler](https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/)[^4]. +Then we applied [Rust’s fearless +concurrency](https://doc.rust-lang.org/book/ch16-00-concurrency.html) – with a +single, [small PR (272 net new lines) +](https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499)we +updated the same memory safe code to run with multiple threads and consume +bounded memory using [tokio for the thread +scheduler](https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/)[^4]. -As shown in Table 4, with this change, tpchgen-cli generates the full SF=100 dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals that at SF=100 our generator is actually IO bound (which is not the case for `dbgen` or `duckdb`) – it creates data **faster than can be written to an SSD**. When writing to `/dev/null` tpchgen generates the entire dataset in 25 seconds (4 GB/s). +As shown in Table 4, with this change, tpchgen-cli generates the full SF=100 +dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals +that at SF=100 our generator is actually IO bound (which is not the case for +`dbgen` or `duckdb`) – it creates data **faster than can be written to an SSD**. +When writing to `/dev/null` tpchgen generates the entire dataset in 25 seconds +(4 GB/s). <table> <tr> - <td>Scale Factor + <td><strong>Scale Factor</strong> </td> <td>Time </td> - <td>Times faster than initial implementation + <td><strong>Times faster than initial implementation</strong> </td> - <td>Times faster than optimized single threaded + <td><strong>Times faster than optimized single threaded</strong> </td> </tr> <tr> @@ -432,18 +526,12 @@ We met while working together on Apache DataFusion in various capacities. If you <!-- Footnotes themselves at the bottom. --> ## Notes -[^1]: -30:35 - -[^2]: +[^1]: `30:35` - It is possible to embed the dbgen code, which appears to be the approach taken by DuckDB. This approach was tried in GlareDB ([GlareDB/glaredb#3313](https://github.com/GlareDB/glaredb/pull/3313)), but ultimately shelved given the amount of effort needed to adapt and isolate the dbgen code. +[^2]: It is possible to embed the dbgen code, which appears to be the approach taken by DuckDB. This approach was tried in GlareDB ([GlareDB/glaredb#3313](https://github.com/GlareDB/glaredb/pull/3313)), but ultimately shelved given the amount of effort needed to adapt and isolate the dbgen code. -[^3]: -It is pretty amazing to imagine the machine required to generate SF300 that had 1.8TB (!!) of RAM +[^3]: It is pretty amazing to imagine the machine required to generate SF300 that had 1.8TB (!!) of RAM -[^4]: -We tried to [use Rayon (see discussion here)](https://github.com/clflushopt/tpchgen-rs/pull/34), but could not easily keep memory bounded. +[^4]: We tried to [use Rayon (see discussion here)](https://github.com/clflushopt/tpchgen-rs/pull/34), but could not easily keep memory bounded. -[^5]: -`tpchgen-cli` memory usage is a function of the number of threads: each thread needs some buffer space +[^5]: `tpchgen-cli` memory usage is a function of the number of threads: each thread needs some buffer space --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
