[
https://issues.apache.org/jira/browse/ARROW-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519229#comment-17519229
]
Jonathan Keane commented on ARROW-16100:
----------------------------------------
Agreed. Either of those approaches sound like they might be fruitful! It's a
bit hidden above, but
https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/dbgen.cpp#L25-L74
might also be helpful and
https://github.com/duckdb/duckdb/blob/31db2bada7548a700d26f61665828dd127abd2a8/extension/tpch/dbgen/dbgen.cpp#L533-L540
is where it looks like it's set|unset.
Also note that the answers for queries that include anything besides
aggregations of numerics will have randomness with random data (so we can't
quite count on the answers being close enough in the limit with enough samples)
{code}
> library(arrowbench)
> library(dplyr, warn.conflicts=FALSE)
> options(arrow.skip_nul = TRUE)
>
> query_id <- 2
>
> input_funcs <- get_input_func(
+ engine = "arrow",
+ scale_factor = 1,
+ query_id = query_id,
+ format = "parquet"
+ )
>
> result <- get_query_func(query_id, "arrow")(input_funcs)
>
> ans <- tpch_answer(1, query_id = query_id)
>
> waldo::compare(result, ans)
old vs new
s_acctbal s_name n_name p_partkey
p_mfgr s_address s_phone
s_comment
- old[1, ] 9973.06 Supplie#r000009288 UNITED KINGDOM 79287
Manufacturer#4 RiRNXmGRQdw1lbm5hcjV94 33-645-115-8290 tions
snooze permanently final sentiments --daring, close pinto beans should have to
eat
+ new[1, ] 9938.53 Supplier#000005359 UNITED KINGDOM 185358
Manufacturer#4 bgxj2K0w1kJvxYl5mhCfou,W 33-429-790-6131
uriously regular requests hag
- old[2, ] 9955.58 Supplie#r000008234 GERMANY 88233
Manufacturer#5 yg83WC5I YIhQPeTMQmXa 17-560-538-8734 s
promise; silent pinto beans may use? furious theodolites wake ruthlessly duri
+ new[2, ] 9937.84 Supplier#000005969 ROMANIA 108438
Manufacturer#1 rdnmd9c8EG1EIAYY3LPVa4yUNx6OwyVaQ 29-520-692-3537 efully
express instructions. regular requests against the slyly fin
- old[3, ] 9946.47 Supplie#r000001975 RUSSIA 161974
Manufacturer#1 wdsf5tQIrJKBNmu8I eY pmAv akzRDSydc 32-937-928-5861 the
sometimes permanent frets must haggle som
+ new[3, ] 9936.22 Supplier#000005250 UNITED KINGDOM 249
Manufacturer#4 qX AB0vP8mJEWeBuY9jri 33-320-228-2957 etect
about the furiously final accounts. slyly ironic pinto beans sleep inside the
furiously
- old[4, ] 9938.94 Supplie#r000006635 FRANCE 41626
Manufacturer#5 kIxC63UH H0et,bSwBA6O81PJ 16-750-637-3467 e
through the idly daring warhorses --permanent patterns serve; escapades over th
+ new[4, ] 9923.77 Supplier#000002324 GERMANY 29821
Manufacturer#4 uXcnR7tv87dG 17-779-299-1839 ackages
boost blithely. blithely regular deposits c
- old[5, ] 9916.93 Supplie#r000004383 ROMANIA 84382
Manufacturer#5 tkEkQY F4J4JDu2WRYDk,owFCP7Qh1eOgHWSk9 29-283-925-2487 telets
could poach busy waters! bl
+ new[5, ] 9871.22 Supplier#000006373 GERMANY 43868
Manufacturer#5 iSLO35z7Ae 17-813-485-8637 etect
blithely bold asymptotes. fluffily ironic platelets wake furiously; blit
- old[6, ] 9884.91 Supplie#r000009799 RUSSIA 24794
Manufacturer#5 YvXpPGXYAg2TJLZJDW3LtRqbv97RzF6As 32-406-413-9944 the
always furious pains
+ new[6, ] 9870.78 Supplier#000001286 GERMANY 81285
Manufacturer#2 3gq0mZLHI5OTM6 tBYmLTHZaulCYnlECzQ7nj 17-516-924-4574
regular accounts. furiously unusual courts above the fi
- old[7, ] 9884.91 Supplie#r000009799 RUSSIA 34792
Manufacturer#2 YvXpPGXYAg2TJLZJDW3LtRqbv97RzF6As 32-406-413-9944 the
always furious pains
+ new[7, ] 9870.78 Supplier#000001286 GERMANY 181285
Manufacturer#4 3gq0mZLHI5OTM6 tBYmLTHZaulCYnlECzQ7nj 17-516-924-4574
regular accounts. furiously unusual courts above the fi
- old[8, ] 9836.42 Supplie#r000001571 UNITED KINGDOM 39067
Manufacturer#3 8lAIPuLFKWVfDQO LXavjzxLkVNF0sw8a6 33-788-833-6417 dogged
orbits beyond the escapades shall
+ new[8, ] 9852.52 Supplier#000008973 RUSSIA 18972
Manufacturer#2 zVfUT3Np22kUC05tYWHBotaR 32-188-594-7038 rns
wake final foxes. carefully unusual depende
- old[9, ] 9836.42 Supplie#r000001571 UNITED KINGDOM 134031
Manufacturer#2 8lAIPuLFKWVfDQO LXavjzxLkVNF0sw8a6 33-788-833-6417 dogged
orbits beyond the escapades shall
+ new[9, ] 9847.83 Supplier#000008097 RUSSIA 130557
Manufacturer#2 veMRTQBmUResNvfD3 32-375-640-3593 the
special excuses. silent sentiments serve carefully final ac
- old[10, ] 9781.88 Supplie#r000005361 ROMANIA 2860
Manufacturer#4 lZYHncM3vjsXc3PPqaFlTO 29-345-620-9471 players
--daring, careful forges past the enticing, thin sentiments try to kindle
stealthily fr
+ new[10, ] 9847.57 Supplier#000006345 FRANCE 86344
Manufacturer#1 68yX tGXAkVRSxUGNSjJdptw 8O878xaFnaoQK 16-886-766-7945 ges.
slyly regular requests are. ruthless, express excuses cajole blithely across
the unu
and 90 more ...
old$s_acctbal | new$s_acctbal
[1] 9973.1 - 9938.5 [1]
[2] 9955.6 - 9937.8 [2]
[3] 9946.5 - 9936.2 [3]
[4] 9938.9 - 9923.8 [4]
[5] 9916.9 - 9871.2 [5]
[6] 9884.9 - 9870.8 [6]
[7] 9884.9 - 9870.8 [7]
[8] 9836.4 - 9852.5 [8]
[9] 9836.4 - 9847.8 [9]
[10] 9781.9 - 9847.6 [10]
... ... ... and 90 more ...
old$s_name | new$s_name
[1] "Supplie#r000009288" - "Supplier#000005359" [1]
[2] "Supplie#r000008234" - "Supplier#000005969" [2]
[3] "Supplie#r000001975" - "Supplier#000005250" [3]
[4] "Supplie#r000006635" - "Supplier#000002324" [4]
[5] "Supplie#r000004383" - "Supplier#000006373" [5]
[6] "Supplie#r000009799" - "Supplier#000001286" [6]
[7] "Supplie#r000009799" - "Supplier#000001286" [7]
[8] "Supplie#r000001571" - "Supplier#000008973" [8]
[9] "Supplie#r000001571" - "Supplier#000008097" [9]
[10] "Supplie#r000005361" - "Supplier#000006345" [10]
... ... ... and 90 more ...
old$n_name | new$n_name
[1] "UNITED KINGDOM" | "UNITED KINGDOM" [1]
[2] "GERMANY" - "ROMANIA" [2]
[3] "RUSSIA" - "UNITED KINGDOM" [3]
[4] "FRANCE" - "GERMANY" [4]
[5] "ROMANIA" - "GERMANY" [5]
[6] "RUSSIA" - "GERMANY" [6]
[7] "RUSSIA" - "GERMANY" [7]
[8] "UNITED KINGDOM" - "RUSSIA" [8]
[9] "UNITED KINGDOM" - "RUSSIA" [9]
[10] "ROMANIA" - "FRANCE" [10]
... ... ... and 90 more ...
old$p_partkey | new$p_partkey
[1] 79287 - 185358 [1]
[2] 88233 - 108438 [2]
[3] 161974 - 249 [3]
[4] 41626 - 29821 [4]
[5] 84382 - 43868 [5]
[6] 24794 - 81285 [6]
[7] 34792 - 181285 [7]
[8] 39067 - 18972 [8]
[9] 134031 - 130557 [9]
[10] 2860 - 86344 [10]
... ... ... and 90 more ...
old$p_mfgr | new$p_mfgr
[1] "Manufacturer#4" | "Manufacturer#4" [1]
[2] "Manufacturer#5" - "Manufacturer#1" [2]
[3] "Manufacturer#1" - "Manufacturer#4" [3]
[4] "Manufacturer#5" - "Manufacturer#4" [4]
[5] "Manufacturer#5" | "Manufacturer#5" [5]
[6] "Manufacturer#5" - "Manufacturer#2" [6]
[7] "Manufacturer#2" - "Manufacturer#4" [7]
[8] "Manufacturer#3" - "Manufacturer#2" [8]
[9] "Manufacturer#2" | "Manufacturer#2" [9]
[10] "Manufacturer#4" - "Manufacturer#1" [10]
... ... ... and 90 more ...
old$s_address vs new$s_address
- "RiRNXmGRQdw1lbm5hcjV94"
+ "bgxj2K0w1kJvxYl5mhCfou,W"
- "yg83WC5I YIhQPeTMQmXa"
+ "rdnmd9c8EG1EIAYY3LPVa4yUNx6OwyVaQ"
- "wdsf5tQIrJKBNmu8I eY pmAv akzRDSydc"
+ "qX AB0vP8mJEWeBuY9jri"
- "kIxC63UH H0et,bSwBA6O81PJ"
+ "uXcnR7tv87dG"
- "tkEkQY F4J4JDu2WRYDk,owFCP7Qh1eOgHWSk9"
+ "iSLO35z7Ae"
- "YvXpPGXYAg2TJLZJDW3LtRqbv97RzF6As"
+ "3gq0mZLHI5OTM6 tBYmLTHZaulCYnlECzQ7nj"
- "YvXpPGXYAg2TJLZJDW3LtRqbv97RzF6As"
+ "3gq0mZLHI5OTM6 tBYmLTHZaulCYnlECzQ7nj"
- "8lAIPuLFKWVfDQO LXavjzxLkVNF0sw8a6 "
+ "zVfUT3Np22kUC05tYWHBotaR"
- "8lAIPuLFKWVfDQO LXavjzxLkVNF0sw8a6 "
+ "veMRTQBmUResNvfD3"
- "lZYHncM3vjsXc3PPqaFlTO"
+ "68yX tGXAkVRSxUGNSjJdptw 8O878xaFnaoQK"
and 90 more ...
old$s_phone | new$s_phone
[1] "33-645-115-8290" - "33-429-790-6131" [1]
[2] "17-560-538-8734" - "29-520-692-3537" [2]
[3] "32-937-928-5861" - "33-320-228-2957" [3]
[4] "16-750-637-3467" - "17-779-299-1839" [4]
[5] "29-283-925-2487" - "17-813-485-8637" [5]
[6] "32-406-413-9944" - "17-516-924-4574" [6]
[7] "32-406-413-9944" - "17-516-924-4574" [7]
[8] "33-788-833-6417" - "32-188-594-7038" [8]
[9] "33-788-833-6417" - "32-375-640-3593" [9]
[10] "29-345-620-9471" - "16-886-766-7945" [10]
... ... ... and 90 more ...
old$s_comment vs new$s_comment
- "tions snooze permanently final sentiments --daring, close pinto beans should
have to eat"
+ "uriously regular requests hag"
- "s promise; silent pinto beans may use? furious theodolites wake ruthlessly
duri"
+ "efully express instructions. regular requests against the slyly fin"
- "the sometimes permanent frets must haggle som"
+ "etect about the furiously final accounts. slyly ironic pinto beans sleep
inside the furiously"
- "e through the idly daring warhorses --permanent patterns serve; escapades
over th"
+ "ackages boost blithely. blithely regular deposits c"
- "telets could poach busy waters! bl"
+ "etect blithely bold asymptotes. fluffily ironic platelets wake furiously;
blit"
- " the always furious pains "
+ " regular accounts. furiously unusual courts above the fi"
- " the always furious pains "
+ " regular accounts. furiously unusual courts above the fi"
- " dogged orbits beyond the escapades shall "
+ "rns wake final foxes. carefully unusual depende"
- " dogged orbits beyond the escapades shall "
+ " the special excuses. silent sentiments serve carefully final ac"
- "players --daring, careful forges past the enticing, thin sentiments try to
kindle stealthily fr"
+ "ges. slyly regular requests are. ruthless, express excuses cajole blithely
across the unu"
and 90 more ...
{code}
> [C++] Add ability to generate TPC-H data that matches the reference data
> exactly
> --------------------------------------------------------------------------------
>
> Key: ARROW-16100
> URL: https://issues.apache.org/jira/browse/ARROW-16100
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jonathan Keane
> Priority: Major
>
> When checking these against the known-good answers for scale factor 1, all of
> the queries are off (two are close enough that they get past arrowbench's
> pretty loose validation).
> Looking at the TPC-H tools, the validation for dbgen is:
> bq. b. Base Data Validation
> bq. The base data set is produced using cmd_base_sf<n> where <n> is the
> scale
> bq. factor to be generated. The resulting files will be produced in the
> current
> bq. working directory. The generated files will be of the form
> <name>.tbl.<nnn>,
> bq. where <name> will is the name of one of the tables in the TPCH schema,
> and
> bq. <nnn> identifies a particular data generation step.
> bq.
> bq. The file set produced by genbaserefdata.sh should match the
> <name>.tbl.<nnn>
> bq. files found in the reference data set for the same scale factor.
> And the data that this generator is producing does not conform to that. Even
> if I sort the data by columns that they appear to be in the dbgen (or duckdb)
> produced data, the data we get from our TPC-H generator does not match.
> We might want a mode where we produce random TPC-H like data. But for
> benchmarking we need a way to produce actual TPC-H compliant data out of the
> box (we can deal with rows in a shuffled order if we need to, but the content
> of the data must be the same.
> Maybe taking a look at [DuckDB's implementation of the random
> seeds|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/dbgen.cpp#L25-L74]
> might help with a way to accomplish this?
> Here's an example of generating data with duckdb (first ten lines of
> lineitems — and it's the same each time I generate it):
> {code}
> > print(out, width = 500)
> l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice
> l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate
> l_receiptdate l_shipinstruct l_shipmode l_comment
> 1 1 155190 7706 1 17 21168.23
> 0.04 0.02 N O 1996-03-13 1996-02-12 1996-03-22
> DELIVER IN PERSON TRUCK egular courts above the
> 2 1 67310 7311 2 36 45983.16
> 0.09 0.06 N O 1996-04-12 1996-02-28 1996-04-20
> TAKE BACK RETURN MAIL ly final dependencies: slyly bold
> 3 1 63700 3701 3 8 13309.60
> 0.10 0.02 N O 1996-01-29 1996-03-05 1996-01-31
> TAKE BACK RETURN REG AIR riously. regular, express dep
> 4 1 2132 4633 4 28 28955.64
> 0.09 0.06 N O 1996-04-21 1996-03-30 1996-05-16
> NONE AIR lites. fluffily even de
> 5 1 24027 1534 5 24 22824.48
> 0.10 0.04 N O 1996-03-30 1996-03-14 1996-04-01
> NONE FOB pending foxes. slyly re
> 6 1 15635 638 6 32 49620.16
> 0.07 0.02 N O 1996-01-30 1996-02-07 1996-02-03
> DELIVER IN PERSON MAIL arefully slyly ex
> 7 2 106170 1191 1 38 44694.46
> 0.00 0.05 N O 1997-01-28 1997-01-14 1997-02-02
> TAKE BACK RETURN RAIL ven requests. deposits breach a
> 8 3 4297 1798 1 45 54058.05
> 0.06 0.00 R F 1994-02-02 1994-01-04 1994-02-23
> NONE AIR ongside of the furiously brave acco
> 9 3 19036 6540 2 49 46796.47
> 0.10 0.00 R F 1993-11-09 1993-12-20 1993-11-24
> TAKE BACK RETURN RAIL unusual accounts. eve
> 10 3 128449 3474 3 27 39890.88
> 0.06 0.07 A F 1994-01-16 1993-11-22 1994-01-23
> DELIVER IN PERSON SHIP nal foxes wake.
> {code}
> And the first ten lines of lineitems I generated with the official dbgen:
> {code}
> head lineitem.tbl
> 1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
> IN PERSON|TRUCK|egular courts above the|
> 1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
> BACK RETURN|MAIL|ly final dependencies: slyly bold |
> 1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
> BACK RETURN|REG AIR|riously. regular, express dep|
> 1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
> fluffily even de|
> 1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
> pending foxes. slyly re|
> 1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
> IN PERSON|MAIL|arefully slyly ex|
> 2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
> BACK RETURN|RAIL|ven requests. deposits breach a|
> 3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
> of the furiously brave acco|
> 3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
> BACK RETURN|RAIL| unusual accounts. eve|
> 3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
> IN PERSON|SHIP|nal foxes wake. |
> {code}
> And the first ten lines of the validation file form the TPC-H tools:
> {code}
> head lineitem.tbl.1
> 1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
> IN PERSON|TRUCK|egular courts above the|
> 1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
> BACK RETURN|MAIL|ly final dependencies: slyly bold |
> 1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
> BACK RETURN|REG AIR|riously. regular, express dep|
> 1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
> fluffily even de|
> 1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
> pending foxes. slyly re|
> 1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
> IN PERSON|MAIL|arefully slyly ex|
> 2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
> BACK RETURN|RAIL|ven requests. deposits breach a|
> 3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
> of the furiously brave acco|
> 3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
> BACK RETURN|RAIL| unusual accounts. eve|
> 3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
> IN PERSON|SHIP|nal foxes wake. |
> {code}
> Note, you can generate this tpc-h data from
> https://github.com/apache/arrow/pull/12769 with the following. This shuffling
> is needed because we can only write datasets from execnodes (without
> materializing into memory entirely), so we move the files around as if they
> were single file writes:
> {code}
> path <- tpch_dbgen_write(1, "some/path")
> from_dataset_to_parquet <- function(path, scale_factor) {
> ds_files <- list.files(path, recursive = TRUE, full.names = TRUE)
> # we can only deal with single parquet files in each partition this way
> if (!all(grepl("data-0.parquet$", ds_files))) {
> stop("At least one partition has more than one file")
> }
>
> ds_files_to <- gsub(
> "/data-0.parquet",
> paste0("_", format(scale_factor, scientific = FALSE), ".parquet"),
> ds_files
> )
> file.rename(ds_files, ds_files_to)
>
> # cleanup empty folders, this might be a bit aggressive
> folders_to_remove <- gsub("/data-0.parquet", "", ds_files)
> unlink(folders_to_remove, recursive = TRUE)
> }
> from_dataset_to_parquet(path, 1)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)