lgaborini opened a new issue, #38804: URL: https://github.com/apache/arrow/issues/38804
### Describe the bug, including details regarding any error messages, version, and platform. With {arrow} 14.0.0, I was able to import a large number of CSVs, merge the schemas, establish a partitioning and write them with `arrow::write_dataset()`. The datasets can be manipulated in-memory and queried without issues, but most queries fail once the dataset is read from disk. The number of rows is also different. While building the reprex, I was able to pin down the error thanks to to a base R error mentioning the presence of embedded nuls in the partitioning variable. This is not always reproducible, though. Also, queries fail at random points: it sounds like a racing condition, but threads are disabled according to `arrow.use_threads`. Reprex [here](https://github.com/lgaborini/arrow-reprex) If one has tips on how to further trim down the dataset, I can refine it. ## Example ### Reading No issues here: ``` r tbl_input <- readr::read_rds(fs::path_expand("~/temp/reprex/data_reprex_small.rds")) # It prints correctly tbl_input #> # A tibble: 245 × 2 #> var_1 var_2 #> <chr> <chr> #> 1 a FIL #> 2 a FIL #> 3 a FIL #> 4 a FIL #> 5 a FIL #> 6 a FIL #> 7 a FIL #> 8 a FIL #> 9 a FIL #> 10 a FIL #> # ℹ 235 more rows tbl_input$var_1 #> [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [19] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [37] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [55] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [73] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [91] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [109] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [127] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [145] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [163] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [181] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [199] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [217] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" #> [235] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" tbl_input$var_2 #> [1] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [11] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [21] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [31] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [41] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [51] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [61] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [71] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [81] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [91] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [101] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [111] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" #> [121] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FILX" #> [131] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [141] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [151] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [161] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [171] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [181] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [191] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [201] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [211] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [221] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [231] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" #> [241] "FILX" "FIL" "FIL" "FILx" "FILx" # Input number of rows nrow(tbl_input) #> [1] 245 # Input queries: OK tbl_input |> dplyr::filter(var_2 == "FIL") |> dplyr::collect() #> # A tibble: 131 × 2 #> var_1 var_2 #> <chr> <chr> #> 1 a FIL #> 2 a FIL #> 3 a FIL #> 4 a FIL #> 5 a FIL #> 6 a FIL #> 7 a FIL #> 8 a FIL #> 9 a FIL #> 10 a FIL #> # ℹ 121 more rows tbl_input |> dplyr::count(var_1) |> dplyr::collect() #> # A tibble: 1 × 2 #> var_1 n #> <chr> <int> #> 1 a 245 ``` ### Writing ```r f_dataset_merged <- tempfile() tbl_input |> arrow::write_dataset( path = f_dataset_merged, partitioning = "var_2" ) ``` ### Reading from disk Usually `arrow::open_dataset()` is s silent (I used also to get errors here). Subsequent queries might fail: ```r tbl_written <- arrow::open_dataset( sources = f_dataset_merged, partitioning = arrow::hive_partition() ) # Queries that might or may not fail # Different number of rows nrow(tbl_written) #> Error: IOError: Could not open Parquet input source '/Temp/RtmpaKq6Pd/file3598288b2763/var_2=FILX/part-0.parquet': Couldn't deserialize thrift: TProtocolException: Invalid data tbl_written |> dplyr::filter(var_2 == "FIL") |> dplyr::collect() #> # A tibble: 131 × 2 #> var_1 var_2 #> <chr> <chr> #> 1 a FIL #> 2 a FIL #> 3 a FIL #> 4 a FIL #> 5 a FIL #> 6 a FIL #> 7 a FIL #> 8 a FIL #> 9 a FIL #> 10 a FIL #> # ℹ 121 more rows tbl_written |> dplyr::count(var_1) |> dplyr::collect() #> Error in `compute.arrow_dplyr_query()`: #> ! IOError: Could not open Parquet input source '/Temp/RtmpaKq6Pd/file3598288b2763/var_2=FILX/part-0.parquet': Couldn't deserialize thrift: TProtocolException: Invalid data #> Backtrace: #> ▆ #> 1. ├─dplyr::collect(dplyr::count(tbl_written, var_1)) #> 2. └─arrow:::collect.arrow_dplyr_query(dplyr::count(tbl_written, var_1)) #> 3. └─arrow:::compute.arrow_dplyr_query(x) #> 4. └─base::tryCatch(...) #> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers) #> 6. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) #> 7. └─value[[3L]](cond) #> 8. └─arrow:::augment_io_error_msg(e, call, schema = schema()) #> 9. └─rlang::abort(msg, call = call) ``` <sup>Created on 2023-11-20 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup> ### Another failure mode Just by relaunching the reprex, the reading parts fails with another path. The row number can be computed (but it's different): ```r # Queries that might or may not fail # Different number of rows nrow(tbl_written) #> [1] 243 tbl_written |> dplyr::filter(var_2 == "FIL") |> dplyr::collect() #> # A tibble: 131 × 2 #> var_1 var_2 #> <chr> <chr> #> 1 a FIL #> 2 a FIL #> 3 a FIL #> 4 a FIL #> 5 a FIL #> 6 a FIL #> 7 a FIL #> 8 a FIL #> 9 a FIL #> 10 a FIL #> # ℹ 121 more rows tbl_written |> dplyr::count(var_1) |> dplyr::collect() #> Error in `compute.arrow_dplyr_query()`: #> ! IOError: Couldn't deserialize thrift: No more data to read. #> Deserializing page header failed. #> Backtrace: #> ▆ #> 1. ├─dplyr::collect(dplyr::count(tbl_written, var_1)) #> 2. └─arrow:::collect.arrow_dplyr_query(dplyr::count(tbl_written, var_1)) #> 3. └─arrow:::compute.arrow_dplyr_query(x) #> 4. └─base::tryCatch(...) #> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers) #> 6. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) #> 7. └─value[[3L]](cond) #> 8. └─arrow:::augment_io_error_msg(e, call, schema = schema()) #> 9. └─rlang::abort(msg, call = call) ``` This is `arrow::arrow_info()`: ``` r arrow::arrow_info() #> Arrow package version: 14.0.0 #> #> Capabilities: #> #> acero TRUE #> dataset TRUE #> substrait FALSE #> parquet TRUE #> json TRUE #> s3 TRUE #> gcs TRUE #> utf8proc TRUE #> re2 TRUE #> snappy TRUE #> gzip TRUE #> brotli TRUE #> zstd TRUE #> lz4 TRUE #> lz4_frame TRUE #> lzo FALSE #> bz2 TRUE #> jemalloc FALSE #> mimalloc TRUE #> #> Arrow options(): #> #> arrow.use_threads FALSE #> #> Memory: #> #> Allocator mimalloc #> Current 0 bytes #> Max 0 bytes #> #> Runtime: #> #> SIMD Level avx2 #> Detected SIMD Level avx2 #> #> Build: #> #> C++ Library Version 14.0.0 #> C++ Compiler GNU #> C++ Compiler Version 10.3.0 #> Git ID 2dcee3f82c6cf54b53a64729fd81840efa583244 ``` ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org