[ 
https://issues.apache.org/jira/browse/ARROW-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-16100:
--------------------------------
    Issue Type: Improvement  (was: Bug)

> [C++] Add ability to generate TPC-H data that matches the reference data 
> exactly
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-16100
>                 URL: https://issues.apache.org/jira/browse/ARROW-16100
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Major
>
> An umbrella issue for a number of issues I've run into with our TPC-H 
> generator.
> h2. We emit fixed_size_binary fields with nuls padding the strings.
> Ideally we would either emit these as utf8 strings like the others, or we 
> would have a toggle to emit them as such (though see below about needing to 
> strip nuls)
> When I try and run these through the I get a number of seg faults or hangs 
> when running a number of the TPC-H queries.
> Additionally, even converting these to utf8|string types, I also need to 
> strip out the nuls in order to actually query against them:
> {code}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> library(dplyr, warn.conflicts = FALSE)
> options(arrow.skip_nul = TRUE)
> tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE)
> tab
> #> Table
> #> 25 rows x 4 columns
> #> $N_NATIONKEY <int32>
> #> $N_NAME <fixed_size_binary[25]>
> #> $N_REGIONKEY <int32>
> #> $N_COMMENT <string>
> # This will not work (Though is how the TPC-H queries are structured)
> tab %>% filter(N_NAME == "JAPAN") %>% collect()
> #> # A tibble: 0 × 4
> #> # … with 4 variables: N_NATIONKEY <int>, N_NAME <fixed_size_binary<25>>,
> #> #   N_REGIONKEY <int>, N_COMMENT <chr>
> # Instead, we need to create the nul padded string to do the comparison
> japan_raw <- as.raw(
>   c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
> 0x00, 
>     0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
> )
> # confirming this is the same thing as in the data 
> japan_raw == as.vector(tab$N_NAME)[[13]]
> #>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 
> TRUE
> #> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> tab %>% filter(N_NAME == Scalar$create(japan_raw, type = 
> fixed_size_binary(25))) %>% collect()
> #> # A tibble: 1 × 4
> #>   N_NATIONKEY
> #>         <int>
> #> 1          12
> #> # … with 3 more variables: N_NAME <fixed_size_binary<25>>, N_REGIONKEY 
> <int>,
> #> #   N_COMMENT <chr>
> {code}
> Here is the code I've been using to cast + strip these out after the fact:
> {code}
> library(arrow, warn.conflicts = FALSE)
> options(arrow.skip_nul = TRUE)
> options(arrow.use_altrep = FALSE)
> tables <- arrowbench:::tpch_tables
>   
> for (table_name in tables) {
>   message("Working on ", table_name)
>   tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"), 
> as_data_frame=FALSE)
>   
>   for (col in tab$schema$fields) {
>     if (inherits(col$type, "FixedSizeBinary")) {
>       message("Rewritting ", col$name)
>       tab[[col$name]] <- 
> Array$create(as.vector(tab[[col$name]]$cast(string())))
>     }
>   }
>   
>   tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet"))
> }
> {code}
> h2.  The data does not produce correct answers
> When checking these against the known-good answers for scale factor 1, all of 
> the queries are off (two are close enough that they get past arrowbench's 
> pretty loose validation).
> Looking at the TPC-H tools, the validation for dbgen is:
> bq. b. Base Data Validation
> bq.   The base data set is produced using cmd_base_sf<n> where <n> is the 
> scale 
> bq.   factor to be generated. The resulting files will be produced in the 
> current 
> bq.   working directory.  The generated files will be of the form 
> <name>.tbl.<nnn>,
> bq.   where <name> will is the name of one of the tables in the TPCH schema, 
> and 
> bq.   <nnn> identifies a particular data generation step.
> bq. 
> bq.   The file set produced by genbaserefdata.sh should match the 
> <name>.tbl.<nnn> 
> bq.   files found in the reference data set for the same scale factor.
> And the data that this generator is producing does not conform to that. Even 
> if I sort the data by columns that they appear to be in the dbgen (or duckdb) 
> produced data, the data we get from our TPC-H generator does not match. 
> We might want a mode where we produce random TPC-H like data. But for 
> benchmarking we need a way to produce actual TPC-H compliant data out of the 
> box (we can deal with rows in a shuffled order if we need to, but the content 
> of the data must be the same.
> Maybe taking a look at [DuckDB's implementation of the random 
> seeds|https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/dbgen.cpp#L25-L74]
>  might help with a way to accomplish this?
> Here's an example of generating data with duckdb (first ten lines of 
> lineitems — and it's the same each time I generate it):
> {code}
> > print(out, width = 500)
>    l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice 
> l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate 
> l_receiptdate    l_shipinstruct l_shipmode                           l_comment
> 1           1    155190      7706            1         17        21168.23     
>   0.04  0.02            N            O 1996-03-13   1996-02-12    1996-03-22 
> DELIVER IN PERSON      TRUCK             egular courts above the
> 2           1     67310      7311            2         36        45983.16     
>   0.09  0.06            N            O 1996-04-12   1996-02-28    1996-04-20  
> TAKE BACK RETURN       MAIL  ly final dependencies: slyly bold 
> 3           1     63700      3701            3          8        13309.60     
>   0.10  0.02            N            O 1996-01-29   1996-03-05    1996-01-31  
> TAKE BACK RETURN    REG AIR       riously. regular, express dep
> 4           1      2132      4633            4         28        28955.64     
>   0.09  0.06            N            O 1996-04-21   1996-03-30    1996-05-16  
>             NONE        AIR             lites. fluffily even de
> 5           1     24027      1534            5         24        22824.48     
>   0.10  0.04            N            O 1996-03-30   1996-03-14    1996-04-01  
>             NONE        FOB             pending foxes. slyly re
> 6           1     15635       638            6         32        49620.16     
>   0.07  0.02            N            O 1996-01-30   1996-02-07    1996-02-03 
> DELIVER IN PERSON       MAIL                   arefully slyly ex
> 7           2    106170      1191            1         38        44694.46     
>   0.00  0.05            N            O 1997-01-28   1997-01-14    1997-02-02  
> TAKE BACK RETURN       RAIL     ven requests. deposits breach a
> 8           3      4297      1798            1         45        54058.05     
>   0.06  0.00            R            F 1994-02-02   1994-01-04    1994-02-23  
>             NONE        AIR ongside of the furiously brave acco
> 9           3     19036      6540            2         49        46796.47     
>   0.10  0.00            R            F 1993-11-09   1993-12-20    1993-11-24  
> TAKE BACK RETURN       RAIL               unusual accounts. eve
> 10          3    128449      3474            3         27        39890.88     
>   0.06  0.07            A            F 1994-01-16   1993-11-22    1994-01-23 
> DELIVER IN PERSON       SHIP                    nal foxes wake. 
> {code}
> And the first ten lines of lineitems I generated with the official dbgen:
> {code}
> head lineitem.tbl
> 1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
>  IN PERSON|TRUCK|egular courts above the|
> 1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
>  BACK RETURN|MAIL|ly final dependencies: slyly bold |
> 1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE 
> BACK RETURN|REG AIR|riously. regular, express dep|
> 1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
>  fluffily even de|
> 1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
>  pending foxes. slyly re|
> 1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
>  IN PERSON|MAIL|arefully slyly ex|
> 2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
>  BACK RETURN|RAIL|ven requests. deposits breach a|
> 3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
>  of the furiously brave acco|
> 3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
>  BACK RETURN|RAIL| unusual accounts. eve|
> 3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
>  IN PERSON|SHIP|nal foxes wake. |
> {code}
> And the first ten lines of the validation file form the TPC-H tools:
> {code}
> head lineitem.tbl.1
> 1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER
>  IN PERSON|TRUCK|egular courts above the|
> 1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE
>  BACK RETURN|MAIL|ly final dependencies: slyly bold |
> 1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE 
> BACK RETURN|REG AIR|riously. regular, express dep|
> 1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites.
>  fluffily even de|
> 1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB|
>  pending foxes. slyly re|
> 1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER
>  IN PERSON|MAIL|arefully slyly ex|
> 2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE
>  BACK RETURN|RAIL|ven requests. deposits breach a|
> 3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside
>  of the furiously brave acco|
> 3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE
>  BACK RETURN|RAIL| unusual accounts. eve|
> 3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER
>  IN PERSON|SHIP|nal foxes wake. |
> {code}
> Note, you can generate this tpc-h data from 
> https://github.com/apache/arrow/pull/12769 with the following. This shuffling 
> is needed because we can only write datasets from execnodes (without 
> materializing into memory entirely), so we move the files around as if they 
> were single file writes:
> {code}
> path <- tpch_dbgen_write(1, "some/path")
> from_dataset_to_parquet <- function(path, scale_factor) {
>   ds_files <- list.files(path, recursive = TRUE, full.names = TRUE)
>   # we can only deal with single parquet files in each partition this way
>   if (!all(grepl("data-0.parquet$", ds_files))) {
>     stop("At least one partition has more than one file")
>   }
>   
>   ds_files_to <- gsub(
>     "/data-0.parquet", 
>     paste0("_", format(scale_factor, scientific = FALSE), ".parquet"), 
>     ds_files
>   )
>   file.rename(ds_files, ds_files_to)
>   
>   # cleanup empty folders, this might be a bit aggressive
>   folders_to_remove <- gsub("/data-0.parquet", "", ds_files)
>   unlink(folders_to_remove, recursive = TRUE)
> }
> from_dataset_to_parquet(path, 1)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to