lgaborini opened a new issue, #38804:
   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   With {arrow} 14.0.0, I was able to import a large number of CSVs, merge the 
schemas, establish a partitioning and write them with `arrow::write_dataset()`. 
   The datasets can be manipulated in-memory and queried without issues, but 
most queries fail once the dataset is read from disk.
   The number of rows is also different.
   While building the reprex, I was able to pin down the error thanks to to a 
base R error mentioning the presence of embedded nuls in the partitioning 
   This is not always reproducible, though.
   Also, queries fail at random points: it sounds like a racing condition, but 
threads are disabled according to `arrow.use_threads`.
   Reprex [here](https://github.com/lgaborini/arrow-reprex)
   If one has tips on how to further trim down the dataset, I can refine it.
   ## Example 
   ### Reading
   No issues here:
   ``` r
   tbl_input <- 
   # It prints correctly
   #> # A tibble: 245 × 2
   #>    var_1 var_2
   #>    <chr> <chr>
   #>  1 a     FIL  
   #>  2 a     FIL  
   #>  3 a     FIL  
   #>  4 a     FIL  
   #>  5 a     FIL  
   #>  6 a     FIL  
   #>  7 a     FIL  
   #>  8 a     FIL  
   #>  9 a     FIL  
   #> 10 a     FIL  
   #> # ℹ 235 more rows
   #>   [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #>  [19] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #>  [37] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #>  [55] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #>  [73] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #>  [91] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #> [109] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #> [127] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #> [145] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #> [163] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #> [181] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #> [199] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #> [217] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
   #> [235] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
   #>   [1] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [11] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [21] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [31] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [41] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [51] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [61] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [71] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [81] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #>  [91] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #> [101] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #> [111] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #> [121] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
   #> [131] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [141] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [151] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [161] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [171] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [181] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [191] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [201] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [211] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [221] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [231] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
   #> [241] "FILX" "FIL"  "FIL"  "FILx" "FILx"
   # Input number of rows
   #> [1] 245
   # Input queries: OK
   tbl_input |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
   #> # A tibble: 131 × 2
   #>    var_1 var_2
   #>    <chr> <chr>
   #>  1 a     FIL  
   #>  2 a     FIL  
   #>  3 a     FIL  
   #>  4 a     FIL  
   #>  5 a     FIL  
   #>  6 a     FIL  
   #>  7 a     FIL  
   #>  8 a     FIL  
   #>  9 a     FIL  
   #> 10 a     FIL  
   #> # ℹ 121 more rows
   tbl_input |> dplyr::count(var_1) |> dplyr::collect()
   #> # A tibble: 1 × 2
   #>   var_1     n
   #>   <chr> <int>
   #> 1 a       245
   ### Writing 
   f_dataset_merged <- tempfile()
   tbl_input |>
         path = f_dataset_merged,
         partitioning = "var_2"
   ### Reading from disk
   Usually `arrow::open_dataset()` is s silent (I used also to get errors 
   Subsequent queries might fail:
   tbl_written <- arrow::open_dataset(
      sources = f_dataset_merged,
      partitioning = arrow::hive_partition()
   # Queries that might or may not fail
   # Different number of rows
   #> Error: IOError: Could not open Parquet input source 
'/Temp/RtmpaKq6Pd/file3598288b2763/var_2=FILX/part-0.parquet': Couldn't 
deserialize thrift: TProtocolException: Invalid data
   tbl_written |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
   #> # A tibble: 131 × 2
   #>    var_1 var_2
   #>    <chr> <chr>
   #>  1 a     FIL  
   #>  2 a     FIL  
   #>  3 a     FIL  
   #>  4 a     FIL  
   #>  5 a     FIL  
   #>  6 a     FIL  
   #>  7 a     FIL  
   #>  8 a     FIL  
   #>  9 a     FIL  
   #> 10 a     FIL  
   #> # ℹ 121 more rows
   tbl_written |> dplyr::count(var_1) |> dplyr::collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! IOError: Could not open Parquet input source 
'/Temp/RtmpaKq6Pd/file3598288b2763/var_2=FILX/part-0.parquet': Couldn't 
deserialize thrift: TProtocolException: Invalid data
   #> Backtrace:
   #>     ▆
   #>  1. ├─dplyr::collect(dplyr::count(tbl_written, var_1))
   #>  2. └─arrow:::collect.arrow_dplyr_query(dplyr::count(tbl_written, var_1))
   #>  3.   └─arrow:::compute.arrow_dplyr_query(x)
   #>  4.     └─base::tryCatch(...)
   #>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>  6.         └─base (local) tryCatchOne(expr, names, parentenv, 
   #>  7.           └─value[[3L]](cond)
   #>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
   #>  9.               └─rlang::abort(msg, call = call)
   <sup>Created on 2023-11-20 with [reprex 
   ### Another failure mode
   Just by relaunching the reprex, the reading parts fails with another path.
   The row number can be computed (but it's different):
   # Queries that might or may not fail
   # Different number of rows
   #> [1] 243
   tbl_written |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
   #> # A tibble: 131 × 2
   #>    var_1 var_2
   #>    <chr> <chr>
   #>  1 a     FIL  
   #>  2 a     FIL  
   #>  3 a     FIL  
   #>  4 a     FIL  
   #>  5 a     FIL  
   #>  6 a     FIL  
   #>  7 a     FIL  
   #>  8 a     FIL  
   #>  9 a     FIL  
   #> 10 a     FIL  
   #> # ℹ 121 more rows
   tbl_written |> dplyr::count(var_1) |> dplyr::collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! IOError: Couldn't deserialize thrift: No more data to read.
   #> Deserializing page header failed.
   #> Backtrace:
   #>     ▆
   #>  1. ├─dplyr::collect(dplyr::count(tbl_written, var_1))
   #>  2. └─arrow:::collect.arrow_dplyr_query(dplyr::count(tbl_written, var_1))
   #>  3.   └─arrow:::compute.arrow_dplyr_query(x)
   #>  4.     └─base::tryCatch(...)
   #>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>  6.         └─base (local) tryCatchOne(expr, names, parentenv, 
   #>  7.           └─value[[3L]](cond)
   #>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
   #>  9.               └─rlang::abort(msg, call = call)
   This is `arrow::arrow_info()`:
   ``` r
   #> Arrow package version: 14.0.0
   #> Capabilities:
   #> acero      TRUE
   #> dataset    TRUE
   #> substrait FALSE
   #> parquet    TRUE
   #> json       TRUE
   #> s3         TRUE
   #> gcs        TRUE
   #> utf8proc   TRUE
   #> re2        TRUE
   #> snappy     TRUE
   #> gzip       TRUE
   #> brotli     TRUE
   #> zstd       TRUE
   #> lz4        TRUE
   #> lz4_frame  TRUE
   #> lzo       FALSE
   #> bz2        TRUE
   #> jemalloc  FALSE
   #> mimalloc   TRUE
   #> Arrow options():
   #> arrow.use_threads FALSE
   #> Memory:
   #> Allocator mimalloc
   #> Current    0 bytes
   #> Max        0 bytes
   #> Runtime:
   #> SIMD Level          avx2
   #> Detected SIMD Level avx2
   #> Build:
   #> C++ Library Version                                    14.0.0
   #> C++ Compiler                                              GNU
   #> C++ Compiler Version                                   10.3.0
   #> Git ID               2dcee3f82c6cf54b53a64729fd81840efa583244
