[
https://issues.apache.org/jira/browse/ARROW-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace updated ARROW-16100:
--------------------------------
Issue Type: Improvement (was: Bug)
> [C++] Add ability to generate TPC-H data that matches the reference data
> exactly
> --------------------------------------------------------------------------------
>
> Key: ARROW-16100
> URL: https://issues.apache.org/jira/browse/ARROW-16100
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jonathan Keane
> Priority: Major
>
> An umbrella issue for a number of issues I've run into with our TPC-H
> generator.
> h2. We emit fixed_size_binary fields with nuls padding the strings.
> Ideally we would either emit these as utf8 strings like the others, or we
> would have a toggle to emit them as such (though see below about needing to
> strip nuls)
> When I try and run these through the I get a number of seg faults or hangs
> when running a number of the TPC-H queries.
> Additionally, even converting these to utf8|string types, I also need to
> strip out the nuls in order to actually query against them:
> {code}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> library(dplyr, warn.conflicts = FALSE)
> options(arrow.skip_nul = TRUE)
> tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE)
> tab
> #> Table
> #> 25 rows x 4 columns
> #> $N_NATIONKEY <int32>
> #> $N_NAME <fixed_size_binary[25]>
> #> $N_REGIONKEY <int32>
> #> $N_COMMENT <string>
> # This will not work (Though is how the TPC-H queries are structured)
> tab %>% filter(N_NAME == "JAPAN") %>% collect()
> #> # A tibble: 0 × 4
> #> # … with 4 variables: N_NATIONKEY <int>, N_NAME <fixed_size_binary<25>>,
> #> # N_REGIONKEY <int>, N_COMMENT <chr>
> # Instead, we need to create the nul padded string to do the comparison
> japan_raw <- as.raw(
> c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
> 0x00,
> 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
> )
> # confirming this is the same thing as in the data
> japan_raw == as.vector(tab$N_NAME)[[13]]
> #> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> TRUE
> #> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> tab %>% filter(N_NAME == Scalar$create(japan_raw, type =
> fixed_size_binary(25))) %>% collect()
> #> # A tibble: 1 × 4
> #> N_NATIONKEY
> #> <int>
> #> 1 12
> #> # … with 3 more variables: N_NAME <fixed_size_binary<25>>, N_REGIONKEY
> <int>,
> #> # N_COMMENT <chr>
> {code}
> Here is the code I've been using to cast + strip these out after the fact:
> {code}
> library(arrow, warn.conflicts = FALSE)
> options(arrow.skip_nul = TRUE)
> options(arrow.use_altrep = FALSE)
> tables <- arrowbench:::tpch_tables
>
> for (table_name in tables) {
> message("Working on ", table_name)
> tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"),
> as_data_frame=FALSE)
>
> for (col in tab$schema$fields) {
> if (inherits(col$type, "FixedSizeBinary")) {
> message("Rewritting ", col$name)
> tab[[col$name]] <-
> Array$create(as.vector(tab[[col$name]]$cast(string())))
> }
> }
>
> tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet"))
> }
> {code}
> h2. The data does not produce correct answers
> When checking these against the known-good answers for scale factor 1, all of
> the queries are off (two are close enough that they get past arrowbench's
> pretty loose validation).
> Looking at the TPC-H tools, the validation for dbgen is:
> bq. b. Base Data Validation
> bq. The base data set is produced using cmd_base_sf<n> where <n> is the
> scale
> bq. factor to be generated. The resulting files will be produced in the
> current
> bq. working directory. The generated files will be of the form
> <name>.tbl.<nnn>,
> bq. where <name> will is the name of one of the tables in the TPCH schema,
> and
> bq. <nnn> identifies a particular data generation step.
> bq.
> bq. The file set produced by genbaserefdata.sh should match the
> <name>.tbl.<nnn>
> bq. files found in the reference data set for the same scale factor.
> And the data that this generator is producing does not conform to that. Even
> if I sort the data by columns that they appear to be in the dbgen (or duckdb)
> produced data, the data we get from our TPC-H generator does not match.
> We might want a mode where we produce random TPC-H like data. But for
> benchmarking we need a way to produce actual TPC-H compliant data out of the
> box (we can deal with rows in a shuffled order if we need to, but the content
> of the data must be the same.
> Maybe taking a look at [DuckDB's implementation of the random
> seeds|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/dbgen.cpp#L25-L74]
> might help with a way to accomplish this?
> Here's an example of generating data with duckdb (first ten lines of
> lineitems — and it's the same each time I generate it):
> {code}
> > print(out, width = 500)
> l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice
> l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate
> l_receiptdate l_shipinstruct l_shipmode l_comment
> 1 1 155190 7706 1 17 21168.23
> 0.04 0.02 N O 1996-03-13 1996-02-12 1996-03-22
> DELIVER IN PERSON TRUCK egular courts above the
> 2 1 67310 7311 2 36 45983.16
> 0.09 0.06 N O 1996-04-12 1996-02-28 1996-04-20
> TAKE BACK RETURN MAIL ly final dependencies: slyly bold
> 3 1 63700 3701 3 8 13309.60
> 0.10 0.02 N O 1996-01-29 1996-03-05 1996-01-31
> TAKE BACK RETURN REG AIR riously. regular, express dep
> 4 1 2132 4633 4 28 28955.64
> 0.09 0.06 N O 1996-04-21 1996-03-30 1996-05-16
> NONE AIR lites. fluffily even de
> 5 1 24027 1534 5 24 22824.48
> 0.10 0.04 N O 1996-03-30 1996-03-14 1996-04-01
> NONE FOB pending foxes. slyly re
> 6 1 15635 638 6 32 49620.16
> 0.07 0.02 N O 1996-01-30 1996-02-07 1996-02-03
> DELIVER IN PERSON MAIL arefully slyly ex
> 7 2 106170 1191 1 38 44694.46
> 0.00 0.05 N O 1997-01-28 1997-01-14 1997-02-02
> TAKE BACK RETURN RAIL ven requests. deposits breach a
> 8 3 4297 1798 1 45 54058.05
> 0.06 0.00 R F 1994-02-02 1994-01-04 1994-02-23
> NONE AIR ongside of the furiously brave acco
> 9 3 19036 6540 2 49 46796.47
> 0.10 0.00 R F 1993-11-09 1993-12-20 1993-11-24
> TAKE BACK RETURN RAIL unusual accounts. eve
> 10 3 128449 3474 3 27 39890.88
> 0.06 0.07 A F 1994-01-16 1993-11-22 1994-01-23
> DELIVER IN PERSON SHIP nal foxes wake.
> {code}
> And the first ten lines of lineitems I generated with the official dbgen:
> {code}
> head lineitem.tbl
> 1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
> IN PERSON|TRUCK|egular courts above the|
> 1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
> BACK RETURN|MAIL|ly final dependencies: slyly bold |
> 1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
> BACK RETURN|REG AIR|riously. regular, express dep|
> 1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
> fluffily even de|
> 1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
> pending foxes. slyly re|
> 1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
> IN PERSON|MAIL|arefully slyly ex|
> 2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
> BACK RETURN|RAIL|ven requests. deposits breach a|
> 3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
> of the furiously brave acco|
> 3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
> BACK RETURN|RAIL| unusual accounts. eve|
> 3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
> IN PERSON|SHIP|nal foxes wake. |
> {code}
> And the first ten lines of the validation file form the TPC-H tools:
> {code}
> head lineitem.tbl.1
> 1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
> IN PERSON|TRUCK|egular courts above the|
> 1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
> BACK RETURN|MAIL|ly final dependencies: slyly bold |
> 1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE
> BACK RETURN|REG AIR|riously. regular, express dep|
> 1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
> fluffily even de|
> 1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
> pending foxes. slyly re|
> 1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
> IN PERSON|MAIL|arefully slyly ex|
> 2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
> BACK RETURN|RAIL|ven requests. deposits breach a|
> 3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
> of the furiously brave acco|
> 3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
> BACK RETURN|RAIL| unusual accounts. eve|
> 3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
> IN PERSON|SHIP|nal foxes wake. |
> {code}
> Note, you can generate this tpc-h data from
> https://github.com/apache/arrow/pull/12769 with the following. This shuffling
> is needed because we can only write datasets from execnodes (without
> materializing into memory entirely), so we move the files around as if they
> were single file writes:
> {code}
> path <- tpch_dbgen_write(1, "some/path")
> from_dataset_to_parquet <- function(path, scale_factor) {
> ds_files <- list.files(path, recursive = TRUE, full.names = TRUE)
> # we can only deal with single parquet files in each partition this way
> if (!all(grepl("data-0.parquet$", ds_files))) {
> stop("At least one partition has more than one file")
> }
>
> ds_files_to <- gsub(
> "/data-0.parquet",
> paste0("_", format(scale_factor, scientific = FALSE), ".parquet"),
> ds_files
> )
> file.rename(ds_files, ds_files_to)
>
> # cleanup empty folders, this might be a bit aggressive
> folders_to_remove <- gsub("/data-0.parquet", "", ds_files)
> unlink(folders_to_remove, recursive = TRUE)
> }
> from_dataset_to_parquet(path, 1)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)