[
https://issues.apache.org/jira/browse/ARROW-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace updated ARROW-16100:
--------------------------------
Description:
When checking these against the known-good answers for scale factor 1, all of
the queries are off (two are close enough that they get past arrowbench's
pretty loose validation).
Looking at the TPC-H tools, the validation for dbgen is:
bq. b. Base Data Validation
bq. The base data set is produced using cmd_base_sf<n> where <n> is the scale
bq. factor to be generated. The resulting files will be produced in the
current
bq. working directory. The generated files will be of the form
<name>.tbl.<nnn>,
bq. where <name> will is the name of one of the tables in the TPCH schema,
and
bq. <nnn> identifies a particular data generation step.
bq.
bq. The file set produced by genbaserefdata.sh should match the
<name>.tbl.<nnn>
bq. files found in the reference data set for the same scale factor.
And the data that this generator is producing does not conform to that. Even if
I sort the data by columns that they appear to be in the dbgen (or duckdb)
produced data, the data we get from our TPC-H generator does not match.
We might want a mode where we produce random TPC-H like data. But for
benchmarking we need a way to produce actual TPC-H compliant data out of the
box (we can deal with rows in a shuffled order if we need to, but the content
of the data must be the same.
Maybe taking a look at [DuckDB's implementation of the random
seeds|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/dbgen.cpp#L25-L74]
might help with a way to accomplish this?
Here's an example of generating data with duckdb (first ten lines of lineitems
— and it's the same each time I generate it):
{code}
> print(out, width = 500)
l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice
l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate
l_receiptdate l_shipinstruct l_shipmode l_comment
1 1 155190 7706 1 17 21168.23
0.04 0.02 N O 1996-03-13 1996-02-12 1996-03-22
DELIVER IN PERSON TRUCK egular courts above the
2 1 67310 7311 2 36 45983.16
0.09 0.06 N O 1996-04-12 1996-02-28 1996-04-20
TAKE BACK RETURN MAIL ly final dependencies: slyly bold
3 1 63700 3701 3 8 13309.60
0.10 0.02 N O 1996-01-29 1996-03-05 1996-01-31
TAKE BACK RETURN REG AIR riously. regular, express dep
4 1 2132 4633 4 28 28955.64
0.09 0.06 N O 1996-04-21 1996-03-30 1996-05-16
NONE AIR lites. fluffily even de
5 1 24027 1534 5 24 22824.48
0.10 0.04 N O 1996-03-30 1996-03-14 1996-04-01
NONE FOB pending foxes. slyly re
6 1 15635 638 6 32 49620.16
0.07 0.02 N O 1996-01-30 1996-02-07 1996-02-03
DELIVER IN PERSON MAIL arefully slyly ex
7 2 106170 1191 1 38 44694.46
0.00 0.05 N O 1997-01-28 1997-01-14 1997-02-02
TAKE BACK RETURN RAIL ven requests. deposits breach a
8 3 4297 1798 1 45 54058.05
0.06 0.00 R F 1994-02-02 1994-01-04 1994-02-23
NONE AIR ongside of the furiously brave acco
9 3 19036 6540 2 49 46796.47
0.10 0.00 R F 1993-11-09 1993-12-20 1993-11-24
TAKE BACK RETURN RAIL unusual accounts. eve
10 3 128449 3474 3 27 39890.88
0.06 0.07 A F 1994-01-16 1993-11-22 1994-01-23
DELIVER IN PERSON SHIP nal foxes wake.
{code}
And the first ten lines of lineitems I generated with the official dbgen:
{code}
head lineitem.tbl
1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
BACK RETURN|REG AIR|riously. regular, express dep|
1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
fluffily even de|
1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
pending foxes. slyly re|
1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
IN PERSON|MAIL|arefully slyly ex|
2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
BACK RETURN|RAIL|ven requests. deposits breach a|
3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
of the furiously brave acco|
3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
BACK RETURN|RAIL| unusual accounts. eve|
3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
IN PERSON|SHIP|nal foxes wake. |
{code}
And the first ten lines of the validation file form the TPC-H tools:
{code}
head lineitem.tbl.1
1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
BACK RETURN|REG AIR|riously. regular, express dep|
1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
fluffily even de|
1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
pending foxes. slyly re|
1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
IN PERSON|MAIL|arefully slyly ex|
2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
BACK RETURN|RAIL|ven requests. deposits breach a|
3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
of the furiously brave acco|
3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
BACK RETURN|RAIL| unusual accounts. eve|
3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
IN PERSON|SHIP|nal foxes wake. |
{code}
Note, you can generate this tpc-h data from
https://github.com/apache/arrow/pull/12769 with the following. This shuffling
is needed because we can only write datasets from execnodes (without
materializing into memory entirely), so we move the files around as if they
were single file writes:
{code}
path <- tpch_dbgen_write(1, "some/path")
from_dataset_to_parquet <- function(path, scale_factor) {
ds_files <- list.files(path, recursive = TRUE, full.names = TRUE)
# we can only deal with single parquet files in each partition this way
if (!all(grepl("data-0.parquet$", ds_files))) {
stop("At least one partition has more than one file")
}
ds_files_to <- gsub(
"/data-0.parquet",
paste0("_", format(scale_factor, scientific = FALSE), ".parquet"),
ds_files
)
file.rename(ds_files, ds_files_to)
# cleanup empty folders, this might be a bit aggressive
folders_to_remove <- gsub("/data-0.parquet", "", ds_files)
unlink(folders_to_remove, recursive = TRUE)
}
from_dataset_to_parquet(path, 1)
{code}
was:
An umbrella issue for a number of issues I've run into with our TPC-H generator.
h2. We emit fixed_size_binary fields with nuls padding the strings.
Ideally we would either emit these as utf8 strings like the others, or we would
have a toggle to emit them as such (though see below about needing to strip
nuls)
When I try and run these through the I get a number of seg faults or hangs when
running a number of the TPC-H queries.
Additionally, even converting these to utf8|string types, I also need to strip
out the nuls in order to actually query against them:
{code}
library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
library(dplyr, warn.conflicts = FALSE)
options(arrow.skip_nul = TRUE)
tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE)
tab
#> Table
#> 25 rows x 4 columns
#> $N_NATIONKEY <int32>
#> $N_NAME <fixed_size_binary[25]>
#> $N_REGIONKEY <int32>
#> $N_COMMENT <string>
# This will not work (Though is how the TPC-H queries are structured)
tab %>% filter(N_NAME == "JAPAN") %>% collect()
#> # A tibble: 0 × 4
#> # … with 4 variables: N_NATIONKEY <int>, N_NAME <fixed_size_binary<25>>,
#> # N_REGIONKEY <int>, N_COMMENT <chr>
# Instead, we need to create the nul padded string to do the comparison
japan_raw <- as.raw(
c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
)
# confirming this is the same thing as in the data
japan_raw == as.vector(tab$N_NAME)[[13]]
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
tab %>% filter(N_NAME == Scalar$create(japan_raw, type =
fixed_size_binary(25))) %>% collect()
#> # A tibble: 1 × 4
#> N_NATIONKEY
#> <int>
#> 1 12
#> # … with 3 more variables: N_NAME <fixed_size_binary<25>>, N_REGIONKEY <int>,
#> # N_COMMENT <chr>
{code}
Here is the code I've been using to cast + strip these out after the fact:
{code}
library(arrow, warn.conflicts = FALSE)
options(arrow.skip_nul = TRUE)
options(arrow.use_altrep = FALSE)
tables <- arrowbench:::tpch_tables
for (table_name in tables) {
message("Working on ", table_name)
tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"),
as_data_frame=FALSE)
for (col in tab$schema$fields) {
if (inherits(col$type, "FixedSizeBinary")) {
message("Rewritting ", col$name)
tab[[col$name]] <- Array$create(as.vector(tab[[col$name]]$cast(string())))
}
}
tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet"))
}
{code}
h2. The data does not produce correct answers
When checking these against the known-good answers for scale factor 1, all of
the queries are off (two are close enough that they get past arrowbench's
pretty loose validation).
Looking at the TPC-H tools, the validation for dbgen is:
bq. b. Base Data Validation
bq. The base data set is produced using cmd_base_sf<n> where <n> is the scale
bq. factor to be generated. The resulting files will be produced in the
current
bq. working directory. The generated files will be of the form
<name>.tbl.<nnn>,
bq. where <name> will is the name of one of the tables in the TPCH schema,
and
bq. <nnn> identifies a particular data generation step.
bq.
bq. The file set produced by genbaserefdata.sh should match the
<name>.tbl.<nnn>
bq. files found in the reference data set for the same scale factor.
And the data that this generator is producing does not conform to that. Even if
I sort the data by columns that they appear to be in the dbgen (or duckdb)
produced data, the data we get from our TPC-H generator does not match.
We might want a mode where we produce random TPC-H like data. But for
benchmarking we need a way to produce actual TPC-H compliant data out of the
box (we can deal with rows in a shuffled order if we need to, but the content
of the data must be the same.
Maybe taking a look at [DuckDB's implementation of the random
seeds|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/dbgen.cpp#L25-L74]
might help with a way to accomplish this?
Here's an example of generating data with duckdb (first ten lines of lineitems
— and it's the same each time I generate it):
{code}
> print(out, width = 500)
l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice
l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate
l_receiptdate l_shipinstruct l_shipmode l_comment
1 1 155190 7706 1 17 21168.23
0.04 0.02 N O 1996-03-13 1996-02-12 1996-03-22
DELIVER IN PERSON TRUCK egular courts above the
2 1 67310 7311 2 36 45983.16
0.09 0.06 N O 1996-04-12 1996-02-28 1996-04-20
TAKE BACK RETURN MAIL ly final dependencies: slyly bold
3 1 63700 3701 3 8 13309.60
0.10 0.02 N O 1996-01-29 1996-03-05 1996-01-31
TAKE BACK RETURN REG AIR riously. regular, express dep
4 1 2132 4633 4 28 28955.64
0.09 0.06 N O 1996-04-21 1996-03-30 1996-05-16
NONE AIR lites. fluffily even de
5 1 24027 1534 5 24 22824.48
0.10 0.04 N O 1996-03-30 1996-03-14 1996-04-01
NONE FOB pending foxes. slyly re
6 1 15635 638 6 32 49620.16
0.07 0.02 N O 1996-01-30 1996-02-07 1996-02-03
DELIVER IN PERSON MAIL arefully slyly ex
7 2 106170 1191 1 38 44694.46
0.00 0.05 N O 1997-01-28 1997-01-14 1997-02-02
TAKE BACK RETURN RAIL ven requests. deposits breach a
8 3 4297 1798 1 45 54058.05
0.06 0.00 R F 1994-02-02 1994-01-04 1994-02-23
NONE AIR ongside of the furiously brave acco
9 3 19036 6540 2 49 46796.47
0.10 0.00 R F 1993-11-09 1993-12-20 1993-11-24
TAKE BACK RETURN RAIL unusual accounts. eve
10 3 128449 3474 3 27 39890.88
0.06 0.07 A F 1994-01-16 1993-11-22 1994-01-23
DELIVER IN PERSON SHIP nal foxes wake.
{code}
And the first ten lines of lineitems I generated with the official dbgen:
{code}
head lineitem.tbl
1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
BACK RETURN|REG AIR|riously. regular, express dep|
1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
fluffily even de|
1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
pending foxes. slyly re|
1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
IN PERSON|MAIL|arefully slyly ex|
2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
BACK RETURN|RAIL|ven requests. deposits breach a|
3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
of the furiously brave acco|
3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
BACK RETURN|RAIL| unusual accounts. eve|
3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
IN PERSON|SHIP|nal foxes wake. |
{code}
And the first ten lines of the validation file form the TPC-H tools:
{code}
head lineitem.tbl.1
1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
BACK RETURN|REG AIR|riously. regular, express dep|
1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
fluffily even de|
1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
pending foxes. slyly re|
1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
IN PERSON|MAIL|arefully slyly ex|
2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
BACK RETURN|RAIL|ven requests. deposits breach a|
3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
of the furiously brave acco|
3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
BACK RETURN|RAIL| unusual accounts. eve|
3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
IN PERSON|SHIP|nal foxes wake. |
{code}
Note, you can generate this tpc-h data from
https://github.com/apache/arrow/pull/12769 with the following. This shuffling
is needed because we can only write datasets from execnodes (without
materializing into memory entirely), so we move the files around as if they
were single file writes:
{code}
path <- tpch_dbgen_write(1, "some/path")
from_dataset_to_parquet <- function(path, scale_factor) {
ds_files <- list.files(path, recursive = TRUE, full.names = TRUE)
# we can only deal with single parquet files in each partition this way
if (!all(grepl("data-0.parquet$", ds_files))) {
stop("At least one partition has more than one file")
}
ds_files_to <- gsub(
"/data-0.parquet",
paste0("_", format(scale_factor, scientific = FALSE), ".parquet"),
ds_files
)
file.rename(ds_files, ds_files_to)
# cleanup empty folders, this might be a bit aggressive
folders_to_remove <- gsub("/data-0.parquet", "", ds_files)
unlink(folders_to_remove, recursive = TRUE)
}
from_dataset_to_parquet(path, 1)
{code}
> [C++] Add ability to generate TPC-H data that matches the reference data
> exactly
> --------------------------------------------------------------------------------
>
> Key: ARROW-16100
> URL: https://issues.apache.org/jira/browse/ARROW-16100
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jonathan Keane
> Priority: Major
>
> When checking these against the known-good answers for scale factor 1, all of
> the queries are off (two are close enough that they get past arrowbench's
> pretty loose validation).
> Looking at the TPC-H tools, the validation for dbgen is:
> bq. b. Base Data Validation
> bq. The base data set is produced using cmd_base_sf<n> where <n> is the
> scale
> bq. factor to be generated. The resulting files will be produced in the
> current
> bq. working directory. The generated files will be of the form
> <name>.tbl.<nnn>,
> bq. where <name> will is the name of one of the tables in the TPCH schema,
> and
> bq. <nnn> identifies a particular data generation step.
> bq.
> bq. The file set produced by genbaserefdata.sh should match the
> <name>.tbl.<nnn>
> bq. files found in the reference data set for the same scale factor.
> And the data that this generator is producing does not conform to that. Even
> if I sort the data by columns that they appear to be in the dbgen (or duckdb)
> produced data, the data we get from our TPC-H generator does not match.
> We might want a mode where we produce random TPC-H like data. But for
> benchmarking we need a way to produce actual TPC-H compliant data out of the
> box (we can deal with rows in a shuffled order if we need to, but the content
> of the data must be the same.
> Maybe taking a look at [DuckDB's implementation of the random
> seeds|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/dbgen.cpp#L25-L74]
> might help with a way to accomplish this?
> Here's an example of generating data with duckdb (first ten lines of
> lineitems — and it's the same each time I generate it):
> {code}
> > print(out, width = 500)
> l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice
> l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate
> l_receiptdate l_shipinstruct l_shipmode l_comment
> 1 1 155190 7706 1 17 21168.23
> 0.04 0.02 N O 1996-03-13 1996-02-12 1996-03-22
> DELIVER IN PERSON TRUCK egular courts above the
> 2 1 67310 7311 2 36 45983.16
> 0.09 0.06 N O 1996-04-12 1996-02-28 1996-04-20
> TAKE BACK RETURN MAIL ly final dependencies: slyly bold
> 3 1 63700 3701 3 8 13309.60
> 0.10 0.02 N O 1996-01-29 1996-03-05 1996-01-31
> TAKE BACK RETURN REG AIR riously. regular, express dep
> 4 1 2132 4633 4 28 28955.64
> 0.09 0.06 N O 1996-04-21 1996-03-30 1996-05-16
> NONE AIR lites. fluffily even de
> 5 1 24027 1534 5 24 22824.48
> 0.10 0.04 N O 1996-03-30 1996-03-14 1996-04-01
> NONE FOB pending foxes. slyly re
> 6 1 15635 638 6 32 49620.16
> 0.07 0.02 N O 1996-01-30 1996-02-07 1996-02-03
> DELIVER IN PERSON MAIL arefully slyly ex
> 7 2 106170 1191 1 38 44694.46
> 0.00 0.05 N O 1997-01-28 1997-01-14 1997-02-02
> TAKE BACK RETURN RAIL ven requests. deposits breach a
> 8 3 4297 1798 1 45 54058.05
> 0.06 0.00 R F 1994-02-02 1994-01-04 1994-02-23
> NONE AIR ongside of the furiously brave acco
> 9 3 19036 6540 2 49 46796.47
> 0.10 0.00 R F 1993-11-09 1993-12-20 1993-11-24
> TAKE BACK RETURN RAIL unusual accounts. eve
> 10 3 128449 3474 3 27 39890.88
> 0.06 0.07 A F 1994-01-16 1993-11-22 1994-01-23
> DELIVER IN PERSON SHIP nal foxes wake.
> {code}
> And the first ten lines of lineitems I generated with the official dbgen:
> {code}
> head lineitem.tbl
> 1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
> IN PERSON|TRUCK|egular courts above the|
> 1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
> BACK RETURN|MAIL|ly final dependencies: slyly bold |
> 1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
> BACK RETURN|REG AIR|riously. regular, express dep|
> 1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
> fluffily even de|
> 1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
> pending foxes. slyly re|
> 1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
> IN PERSON|MAIL|arefully slyly ex|
> 2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
> BACK RETURN|RAIL|ven requests. deposits breach a|
> 3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
> of the furiously brave acco|
> 3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
> BACK RETURN|RAIL| unusual accounts. eve|
> 3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
> IN PERSON|SHIP|nal foxes wake. |
> {code}
> And the first ten lines of the validation file form the TPC-H tools:
> {code}
> head lineitem.tbl.1
> 1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
> IN PERSON|TRUCK|egular courts above the|
> 1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
> BACK RETURN|MAIL|ly final dependencies: slyly bold |
> 1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
> BACK RETURN|REG AIR|riously. regular, express dep|
> 1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
> fluffily even de|
> 1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
> pending foxes. slyly re|
> 1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
> IN PERSON|MAIL|arefully slyly ex|
> 2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
> BACK RETURN|RAIL|ven requests. deposits breach a|
> 3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
> of the furiously brave acco|
> 3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
> BACK RETURN|RAIL| unusual accounts. eve|
> 3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
> IN PERSON|SHIP|nal foxes wake. |
> {code}
> Note, you can generate this tpc-h data from
> https://github.com/apache/arrow/pull/12769 with the following. This shuffling
> is needed because we can only write datasets from execnodes (without
> materializing into memory entirely), so we move the files around as if they
> were single file writes:
> {code}
> path <- tpch_dbgen_write(1, "some/path")
> from_dataset_to_parquet <- function(path, scale_factor) {
> ds_files <- list.files(path, recursive = TRUE, full.names = TRUE)
> # we can only deal with single parquet files in each partition this way
> if (!all(grepl("data-0.parquet$", ds_files))) {
> stop("At least one partition has more than one file")
> }
>
> ds_files_to <- gsub(
> "/data-0.parquet",
> paste0("_", format(scale_factor, scientific = FALSE), ".parquet"),
> ds_files
> )
> file.rename(ds_files, ds_files_to)
>
> # cleanup empty folders, this might be a bit aggressive
> folders_to_remove <- gsub("/data-0.parquet", "", ds_files)
> unlink(folders_to_remove, recursive = TRUE)
> }
> from_dataset_to_parquet(path, 1)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)