(datafusion-site) 02/03: Complete port

alamb Fri, 04 Apr 2025 12:58:25 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch site/tpch_data_generator
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git


commit 503401b3fb4a7371e1575b122ef09f1cad36a9e0
Author: Andrew Lamb <[email protected]>
AuthorDate: Fri Apr 4 15:53:37 2025 -0400

    Complete port
---
 content/blog/2025-04-10-fastest-tpch-generator.md | 246 +++++++++++++++-------
 1 file changed, 167 insertions(+), 79 deletions(-)

diff --git a/content/blog/2025-04-10-fastest-tpch-generator.md 
b/content/blog/2025-04-10-fastest-tpch-generator.md
index 84a83c8..ddb3f22 100644
--- a/content/blog/2025-04-10-fastest-tpch-generator.md
+++ b/content/blog/2025-04-10-fastest-tpch-generator.md
@@ -23,37 +23,60 @@ limitations under the License.
 {% endcomment %}
 -->
 
+<style>
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+</style>
+
 We used Rust and open source development to build 
[tpchgen-rs](https://github.com/alamb/tpchgen-rs), a fully open TPCH data 
generator over 10x faster than any other such generator we know of.
 
 Authors:
 * [Andrew Lamb](https://www.linkedin.com/in/andrewalamb/) 
([@alamb](https://github.com/alamb)) is a Staff Engineer at 
[InfluxData](https://www.influxdata.com/) and an [Apache 
DataFusion](https://datafusion.apache.org/) and Apache Arrow PMC member.
 * Achraf B ([@clflushopt](https://github.com/clflushopt)) is a Software 
Engineer at [Optable](https://optable.co/) where he works on data 
infrastructure.
-* [Sean Smith](https://www.linkedin.com/in/scsmithr/) 
([@scsmithr](https://github.com/scsmithr)) is the founder of 
[GlareDB](https://glaredb.com/) where focused on building a fast analytics 
database.
-
-It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 
GB/s 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
`dbgen` which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes 
less than 2 minutes to create all 3.6 GB of SF=100 in [Apache 
Parquet](https://parquet.apache.org/) format.
-
-Finally, it is convenient and efficient to run TPCH queries locally when 
testing analytical engines.
-
-
-<p id="gdcalert1" ><span style="color: red; font-weight: bold">>>>>>  
gd2md-html alert: inline image link here (to images/image1.png). Store image on 
your image server and adjust path/filename/extension if necessary. 
</span><br>(<a href="#">Back to top</a>)(<a href="#gdcalert2">Next 
alert</a>)<br><span style="color: red; font-weight: bold">>>>>> </span></p>
-
-
-![alt_text](images/image1.png "image_tooltip")
-
-
-**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, 
100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and 
[DuckDB](https://duckdb.org/) takes 17 minutes and 48 seconds. For SF=1000, 
`tpchgen` takes 10 minutes and 26 and uses about 5 GB of RAM at peak, and we 
could not measure DuckDB’s time as it [requires 647 GB of 
RAM](https://duckdb.org/docs/stable/extensions/tp [...]
-
-
-
-This blog explains what TPCH is, how we ported the vintage C data generator to 
Rust (yes, 
[RWIR](https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/)),
 and optimized its performance over the course of a few weeks of part time 
work. We began this project so we can easily generate TPCH data in [Apache 
DataFusion](https://datafusion.apache.org/) and [GlareDB](https://glaredb.com/).
-
+* [Sean Smith](https://www.linkedin.com/in/scsmithr/) 
([@scsmithr](https://github.com/scsmithr)) is the founder of 
[GlareDB](https://glaredb.com/) focused on building a fast analytics database.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache
+Parquet](https://parquet.apache.org/) format.
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+analytical engines such as DataFusion.
+
+<img src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
alt="Time to create TPCH parquet dataset for Scale Factor  1, 10, 100 and 1000" 
width="80%" class="img-responsive">
+
+**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB](https://duckdb.org/) takes 17 minutes and 48 seconds. For SF=1000,
+`tpchgen` takes 10 minutes and 26 and uses about 5 GB of RAM at peak, and we
+could not measure DuckDB’s time as it [requires 647 GB of
+RAM](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator),
+more than the 88 GB that was available on our test machine. The testing
+methodology is in the
+[documentation](https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md).
+
+This blog explains what TPCH is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part time work. We began this project so we can easily generate TPCH data in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
 
 # Try if for yourself
 
 The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
 
-
-```
+```shell
 $ cargo install tpchgen-cli
 
 # create SF=1 in classic TBL format
@@ -63,22 +86,42 @@ $ tpchgen-cli -s 1
 $ tpchgen-cli -s 10 --format=parquet
 ```
 
-
-
 # What is TPCH / dbgen?
 
-The popular [TPC-H](https://www.tpc.org/tpch/) benchmark (commonly referred to 
as TPCH) helps evaluate the performance of database systems on 
[OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) queries*, 
*the kind used to build BI dashboards.
-
-TPCH has become a de facto standard for analytic systems. While there are 
[well known](https://www.vldb.org/pvldb/vol9/p204-leis.pdf) limitations as the 
data and queries do not well represent many real world use cases, the majority 
of analytic database papers and industrial systems still use TPCH query 
performance benchmarks as a baseline. You will inevitably find multiple results 
for  “`TPCH Performance &lt;your favorite database>`” in any search engine.
-
-The benchmark was created at a time when access to high performance analytical 
systems was not widespread, so the [Transaction Processing Performance 
Council](https://www.tpc.org/) defined a process of formal result verification. 
More recently, given the broad availability of free and open source database 
systems, it is common for users to run and verify TPCH performance themselves.
-
-TPCH simulates a business environment with eight tables: `REGION`, `NATION`, 
`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These 
tables are linked by foreign keys in a normalized schema representing a supply 
chain with parts, suppliers, customers and orders. The benchmark itself is 22 
SQL queries containing joins, aggregations, and sorting operations.
-
-The queries run against data created with 
<code>[dbgen](https://github.com/electrum/tpch-dbgen)</code>, a program written 
in a pre [C-99](https://en.wikipedia.org/wiki/C99) dialect, which generates 
data in a format called *TBL* (example in Figure 2). `dbgen` creates data for 
each of the 8 tables for a certain *Scale Factor*, commonly abbreviated as SF. 
Example Scale Factors and corresponding dataset sizes are shown in Table 1. 
There is no theoretical upper bound on the Scale Factor.
-
-
-```
+The popular [TPC-H](https://www.tpc.org/tpch/) benchmark (commonly referred to
+as TPCH) helps evaluate the performance of database systems on
+[OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) queries*,
+*the kind used to build BI dashboards.
+
+TPCH has become a de facto standard for analytic systems. While there are [well
+known](https://www.vldb.org/pvldb/vol9/p204-leis.pdf) limitations as the data
+and queries do not well represent many real world use cases, the majority of
+analytic database papers and industrial systems still use TPCH query 
performance
+benchmarks as a baseline. You will inevitably find multiple results for  “`TPCH
+Performance &lt;your favorite database>`” in any search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance
+Council](https://www.tpc.org/) defined a process of formal result verification.
+More recently, given the broad availability of free and open source database
+systems, it is common for users to run and verify TPCH performance themselves.
+
+TPCH simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with
+<code>[dbgen](https://github.com/electrum/tpch-dbgen)</code>, a program written
+in a pre [C-99](https://en.wikipedia.org/wiki/C99) dialect, which generates 
data
+in a format called *TBL* (example in Figure 2). `dbgen` creates data for each 
of
+the 8 tables for a certain *Scale Factor*, commonly abbreviated as SF. Example
+Scale Factors and corresponding dataset sizes are shown in Table 1. There is no
+theoretical upper bound on the Scale Factor.
+
+
+```text
 
103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic
 accou|
 
229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously
 pending |
 
263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully
 express fo|
@@ -151,28 +194,43 @@ Despite the known limitations of the TPCH benchmark, it 
is so well known that it
 
 
 
-1. You must find and compile a copy of the 15+ year old C program (for example 
[electrum/tpch-dbgen](https://github.com/electrum/tpch-dbgen))
+1. You must find and compile a copy of the 15+ year old C program (for example 
[electrum/tpch-dbgen])
 2. `dbgen` requires substantial time (Figure 3) and is not able to use more 
than one core.
-3. It outputs TBL format, which typically requires loading into your database 
(for example, [here is how to do 
so](https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124)
 in Apache DataFusion) prior to query.
+3. It outputs TBL format, which typically requires loading into your database 
(for example, [here is how to do so] in Apache DataFusion) prior to query.
 4. The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other systems.[^2]
 
-
-<p id="gdcalert2" ><span style="color: red; font-weight: bold">>>>>>  
gd2md-html alert: inline image link here (to images/image2.png). Store image on 
your image server and adjust path/filename/extension if necessary. 
</span><br>(<a href="#">Back to top</a>)(<a href="#gdcalert3">Next 
alert</a>)<br><span style="color: red; font-weight: bold">>>>>> </span></p>
-
-
-![alt_text](images/image2.png "image_tooltip")
-
-
-**Figure 3**: Time to generate TPCH data in TBL format. The default `tpchgen` 
is shown in blue. `tpchgen` restricted to a single core is shown in red. 
Unmodified `dbgen` is shown in green and `dbgen` modified to use `-O3` 
optimization level is shown in yellow.
-
-`dbgen` is so inconvenient and takes so long that vendors often provide 
pre-loaded TPCH data, for example [Snowflake Sample 
Data](https://docs.snowflake.com/en/user-guide/sample-data-tpch), [DataBricks 
Sample 
datasets](https://docs.databricks.com/aws/en/discover/databricks-datasets) and 
[DuckDB Pre-Generated Data 
Sets](https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets).
-
-In addition to pre-generated datasets, DuckDB also provides a 
[TPCH](https://duckdb.org/docs/stable/extensions/tpch.html) extension for 
generating TPCH datasets within DuckDB. This is so much easier to use than the 
current alternatives that it leads many researchers and other thought leaders 
to use DuckDB to evaluate new ideas. For example, [Wan Shen 
Lim](https://github.com/lmwnshn) explicitly [mentioned the ease of creating the 
TPCH dataset](https://github.com/apache/datafusion/issues/1 [...]
+[electrum/tpch-dbgen]: https://github.com/electrum/tpch-dbgen
+[here is how to do so]: 
https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124
+ 
+
+<img src="/blog/images/fastest-tpch-generator/tbl-performance.png" alt="Time 
to generate TPCH data in TBL format" width="80%" class="img-responsive">
+
+**Figure 3**: Time to generate TPCH data in TBL format. The default `tpchgen` 
is
+shown in blue. `tpchgen` restricted to a single core is shown in red. 
Unmodified
+`dbgen` is shown in green and `dbgen` modified to use `-O3` optimization level
+is shown in yellow.
+
+`dbgen` is so inconvenient and takes so long that vendors often provide
+pre-loaded TPCH data, for example [Snowflake Sample
+Data](https://docs.snowflake.com/en/user-guide/sample-data-tpch), [DataBricks
+Sample
+datasets](https://docs.databricks.com/aws/en/discover/databricks-datasets) and
+[DuckDB Pre-Generated Data
+Sets](https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets).
+
+In addition to pre-generated datasets, DuckDB also provides a
+[TPCH](https://duckdb.org/docs/stable/extensions/tpch.html) extension for
+generating TPCH datasets within DuckDB. This is so much easier to use than the
+current alternatives that it leads many researchers and other thought leaders 
to
+use DuckDB to evaluate new ideas. For example, [Wan Shen
+Lim](https://github.com/lmwnshn) explicitly [mentioned the ease of creating the
+TPCH dataset](https://github.com/apache/datafusion/issues/14373) as one reason
+the first student project of [CMU-799 Spring
+2025](https://15799.courses.cs.cmu.edu/spring2025/) used DuckDB.
 
 As beneficial as the DuckDB TPCH extension is, it is non ideal for several 
reasons:
 
 
-
 1. Creates data in a proprietary format, which requires export to use in other 
systems.
 2. Requires significant time (e.g. 17 minutes for Scale Factor 10).
 3. Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale Factor 
10)
@@ -188,31 +246,52 @@ While Rust does allow unsafe access to memory (eliding 
bounds checking, for exam
 
 
 
-<p id="gdcalert3" ><span style="color: red; font-weight: bold">>>>>>  
gd2md-html alert: inline drawings not supported directly from Docs. You may 
want to copy the inline drawing to a standalone drawing and export by 
reference. See <a 
href="https://github.com/evbacher/gd2md-html/wiki/Google-Drawings-by-reference";>Google
 Drawings by reference</a> for details. The img URL below is a placeholder. 
</span><br>(<a href="#">Back to top</a>)(<a href="#gdcalert4">Next 
alert</a>)<br><span style="co [...]
+<img src="/blog/images/fastest-tpch-generator/lamb-theory.png" alt="Lamb 
Theory on Evolution of Systems Languages" width="80%" class="img-responsive">
 
+**Figure 4**: Lamb Theory of System Language Evolution from [Boston University
+MiDAS Fall 2024 (Data Systems Seminar)] [slides(pdf)], [recording]. Special
+thanks to [@KurtFehlhauer]
 
-![drawing](https://docs.google.com/drawings/d/12345/export/png)
-
-Figure 4: Lamb Theory of System Language Evolution from [Boston University: 
MiDAS Fall 2024 (Data Systems Seminar) ](https://midas.bu.edu/seminar.html) 
[slides(pdf)](https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf), 
[recording](https://www.youtube.com/watch?v=CpnxuBwHbUc)
-
+[Boston University MiDAS Fall 2024 (Data Systems Seminar)]: 
https://midas.bu.edu/seminar.html
+[slides(pdf)]: https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf
+[recording]: https://www.youtube.com/watch?v=CpnxuBwHbUc
+[@KurtFehlhauer]: https://x.com/KurtFehlhauer
 
 # How: The Journey
 
-We did it together as a team in the open over the course of a few weeks weeks. 
[Wan Shen Lim](https://github.com/lmwnshn) inspired the project by pointing out 
the benefits of [easy TPCH dataset 
creation](https://github.com/apache/datafusion/issues/14373)  and [suggesting 
we check out a Java port on February 11, 
2025](https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600).
 Achraf made [first commit a few days 
later](https://github.com/clflushopt/tpchgen-rs/commit/53d34 [...]
+We did it together as a team in the open over the course of a few weeks weeks.
+[Wan Shen Lim](https://github.com/lmwnshn) inspired the project by pointing out
+the benefits of [easy TPCH dataset
+creation](https://github.com/apache/datafusion/issues/14373)  and [suggesting 
we
+check out a Java port on February 11,
+2025](https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600).
+Achraf made [first commit a few days
+later](https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0)
+on February 16, and [Andrew and Sean started helping on March 8,
+2025](https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e)
+and we [released version 0.1](https://crates.io/crates/tpchgen/0.1.0) on March
+30, 2025.
 
 
 ## Optimizing Single Threaded Performance
 
-Archaf [completed the end to end conformance 
tests](https://github.com/clflushopt/tpchgen-rs/pull/16), to ensure 
correctness, and an initial [cli check 
in](https://github.com/clflushopt/tpchgen-rs/pull/12) on March 15, 2025.
+Archaf [completed the end to end conformance
+tests](https://github.com/clflushopt/tpchgen-rs/pull/16), to ensure 
correctness,
+and an initial [cli check in](https://github.com/clflushopt/tpchgen-rs/pull/12)
+on March 15, 2025.
 
-On a Macbook Pro M3 (Nov 2023), the initial performance numbers were actually 
slower than the original Java implementation which was ported 😭. This wasn’t 
surprising since the focus of the first version was to get a byte of byte 
compatible port and we were aware about the performance shortcomings and how to 
approach them.
+On a Macbook Pro M3 (Nov 2023), the initial performance numbers were actually
+slower than the original Java implementation which was ported 😭. This wasn’t
+surprising since the focus of the first version was to get a byte of byte
+compatible port and we were aware about the performance shortcomings and how to
+approach them.
 
 
 <table>
   <tr>
-   <td>Scale Factor
+   <td><strong>Scale Factor</strong>
    </td>
-   <td>Time
+   <td><strong>Time</strong>
    </td>
   </tr>
   <tr>
@@ -236,10 +315,14 @@ On a Macbook Pro M3 (Nov 2023), the initial performance 
numbers were actually sl
 </table>
 
 
-**Table 2**: Performance of running [the initial 
tpchgen-cli](https://github.com/clflushopt/tpchgen-rs/pull/12), measured with 
`time target/release/tpchgen-cli -s $SCALE_FACTOR `
-
-With this strong foundation we began optimizing the code using Rust’s low 
level memory management to improve performance while retaining memory safely. 
We spent several days obsessing over low level details and implemented a 
textbook like list of optimizations:
+**Table 2**: Performance of running [the initial
+tpchgen-cli](https://github.com/clflushopt/tpchgen-rs/pull/12), measured with
+`time target/release/tpchgen-cli -s $SCALE_FACTOR `
 
+With this strong foundation we began optimizing the code using Rust’s low level
+memory management to improve performance while retaining memory safely. We 
spent
+several days obsessing over low level details and implemented a textbook like
+list of optimizations:
 
 
 * [Avoiding startup 
overhead](https://github.com/clflushopt/tpchgen-rs/pull/19),
@@ -293,20 +376,31 @@ At the time of writing, single threaded performance is 
now 2.5x-2.7x faster than
 
 ## Multi-threading
 
-Then we applied [Rust’s fearless 
concurrency](https://doc.rust-lang.org/book/ch16-00-concurrency.html) – with a 
single, [small PR (272 net new lines) 
](https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499)we
 updated the same memory safe code to run with multiple threads and consume 
bounded memory using [tokio for the thread 
scheduler](https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/)[^4].
+Then we applied [Rust’s fearless
+concurrency](https://doc.rust-lang.org/book/ch16-00-concurrency.html) – with a
+single, [small PR (272 net new lines)
+](https://github.com/clflushopt/tpchgen-rs/commit/ab720a70cdc80a711f4a3dda6bac05445106f499)we
+updated the same memory safe code to run with multiple threads and consume
+bounded memory using [tokio for the thread
+scheduler](https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/)[^4].
 
-As shown in Table 4, with this change, tpchgen-cli generates the full SF=100 
dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals 
that at SF=100 our generator is actually IO bound (which is not the case for 
`dbgen` or `duckdb`) – it creates data **faster than can be written to an 
SSD**. When writing to `/dev/null` tpchgen  generates the entire dataset in 25 
seconds (4 GB/s).
+As shown in Table 4, with this change, tpchgen-cli generates the full SF=100
+dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals
+that at SF=100 our generator is actually IO bound (which is not the case for
+`dbgen` or `duckdb`) – it creates data **faster than can be written to an 
SSD**.
+When writing to `/dev/null` tpchgen  generates the entire dataset in 25 seconds
+(4 GB/s).
 
 
 <table>
   <tr>
-   <td>Scale Factor
+   <td><strong>Scale Factor</strong>
    </td>
    <td>Time
    </td>
-   <td>Times faster than initial implementation 
+   <td><strong>Times faster than initial implementation</strong>    
    </td>
-   <td>Times faster than optimized single threaded
+   <td><strong>Times faster than optimized single threaded</strong>
    </td>
   </tr>
   <tr>
@@ -432,18 +526,12 @@ We met while working together on Apache DataFusion in 
various capacities. If you
 <!-- Footnotes themselves at the bottom. -->
 ## Notes
 
-[^1]:
-30:35
-
-[^2]:
+[^1]: `30:35`
 
-     It is possible to embed the dbgen code, which appears to be the approach 
taken by DuckDB. This approach was tried in GlareDB 
([GlareDB/glaredb#3313](https://github.com/GlareDB/glaredb/pull/3313)), but 
ultimately shelved given the amount of effort needed to adapt and isolate the 
dbgen code.
+[^2]: It is possible to embed the dbgen code, which appears to be the approach 
taken by DuckDB. This approach was tried in GlareDB 
([GlareDB/glaredb#3313](https://github.com/GlareDB/glaredb/pull/3313)), but 
ultimately shelved given the amount of effort needed to adapt and isolate the 
dbgen code.
 
-[^3]:
-It is pretty amazing to imagine the machine required to generate SF300 that 
had 1.8TB (!!) of RAM
+[^3]: It is pretty amazing to imagine the machine required to generate SF300 
that had 1.8TB (!!) of RAM
 
-[^4]:
-We tried to [use Rayon (see discussion 
here)](https://github.com/clflushopt/tpchgen-rs/pull/34), but could not easily 
keep memory bounded.
+[^4]: We tried to [use Rayon (see discussion 
here)](https://github.com/clflushopt/tpchgen-rs/pull/34), but could not easily 
keep memory bounded.
 
-[^5]:
-`tpchgen-cli` memory usage is a function of the number of threads:  each 
thread needs some buffer space
+[^5]: `tpchgen-cli` memory usage is a function of the number of threads:  each 
thread needs some buffer space


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-site) 02/03: Complete port

Reply via email to