lgaborini opened a new issue, #38804:
URL: https://github.com/apache/arrow/issues/38804
### Describe the bug, including details regarding any error messages,
version, and platform.
With {arrow} 14.0.0, I was able to import a large number of CSVs, merge the
schemas, establish a partitioning and write them with `arrow::write_dataset()`.
The datasets can be manipulated in-memory and queried without issues, but
most queries fail once the dataset is read from disk.
The number of rows is also different.
While building the reprex, I was able to pin down the error thanks to to a
base R error mentioning the presence of embedded nuls in the partitioning
variable.
This is not always reproducible, though.
Also, queries fail at random points: it sounds like a racing condition, but
threads are disabled according to `arrow.use_threads`.
Reprex [here](https://github.com/lgaborini/arrow-reprex)
If one has tips on how to further trim down the dataset, I can refine it.
## Example
### Reading
No issues here:
``` r
tbl_input <-
readr::read_rds(fs::path_expand("~/temp/reprex/data_reprex_small.rds"))
# It prints correctly
tbl_input
#> # A tibble: 245 × 2
#> var_1 var_2
#> <chr> <chr>
#> 1 a FIL
#> 2 a FIL
#> 3 a FIL
#> 4 a FIL
#> 5 a FIL
#> 6 a FIL
#> 7 a FIL
#> 8 a FIL
#> 9 a FIL
#> 10 a FIL
#> # ℹ 235 more rows
tbl_input$var_1
#> [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [19] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [37] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [55] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [73] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [91] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [109] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [127] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [145] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [163] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [181] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [199] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [217] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a"
#> [235] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
tbl_input$var_2
#> [1] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [11] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [21] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [31] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [41] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [51] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [61] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [71] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [81] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [91] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [101] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [111] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FIL"
#> [121] "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL" "FIL"
"FILX"
#> [131] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [141] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [151] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [161] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [171] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [181] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [191] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [201] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [211] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [221] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [231] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX"
"FILX"
#> [241] "FILX" "FIL" "FIL" "FILx" "FILx"
# Input number of rows
nrow(tbl_input)
#> [1] 245
# Input queries: OK
tbl_input |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
#> # A tibble: 131 × 2
#> var_1 var_2
#> <chr> <chr>
#> 1 a FIL
#> 2 a FIL
#> 3 a FIL
#> 4 a FIL
#> 5 a FIL
#> 6 a FIL
#> 7 a FIL
#> 8 a FIL
#> 9 a FIL
#> 10 a FIL
#> # ℹ 121 more rows
tbl_input |> dplyr::count(var_1) |> dplyr::collect()
#> # A tibble: 1 × 2
#> var_1 n
#> <chr> <int>
#> 1 a 245
```
### Writing
```r
f_dataset_merged <- tempfile()
tbl_input |>
arrow::write_dataset(
path = f_dataset_merged,
partitioning = "var_2"
)
```
### Reading from disk
Usually `arrow::open_dataset()` is s silent (I used also to get errors
here).
Subsequent queries might fail:
```r
tbl_written <- arrow::open_dataset(
sources = f_dataset_merged,
partitioning = arrow::hive_partition()
)
# Queries that might or may not fail
# Different number of rows
nrow(tbl_written)
#> Error: IOError: Could not open Parquet input source
'/Temp/RtmpaKq6Pd/file3598288b2763/var_2=FILX/part-0.parquet': Couldn't
deserialize thrift: TProtocolException: Invalid data
tbl_written |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
#> # A tibble: 131 × 2
#> var_1 var_2
#> <chr> <chr>
#> 1 a FIL
#> 2 a FIL
#> 3 a FIL
#> 4 a FIL
#> 5 a FIL
#> 6 a FIL
#> 7 a FIL
#> 8 a FIL
#> 9 a FIL
#> 10 a FIL
#> # ℹ 121 more rows
tbl_written |> dplyr::count(var_1) |> dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! IOError: Could not open Parquet input source
'/Temp/RtmpaKq6Pd/file3598288b2763/var_2=FILX/part-0.parquet': Couldn't
deserialize thrift: TProtocolException: Invalid data
#> Backtrace:
#> ▆
#> 1. ├─dplyr::collect(dplyr::count(tbl_written, var_1))
#> 2. └─arrow:::collect.arrow_dplyr_query(dplyr::count(tbl_written, var_1))
#> 3. └─arrow:::compute.arrow_dplyr_query(x)
#> 4. └─base::tryCatch(...)
#> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 6. └─base (local) tryCatchOne(expr, names, parentenv,
handlers[[1L]])
#> 7. └─value[[3L]](cond)
#> 8. └─arrow:::augment_io_error_msg(e, call, schema = schema())
#> 9. └─rlang::abort(msg, call = call)
```
<sup>Created on 2023-11-20 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
### Another failure mode
Just by relaunching the reprex, the reading parts fails with another path.
The row number can be computed (but it's different):
```r
# Queries that might or may not fail
# Different number of rows
nrow(tbl_written)
#> [1] 243
tbl_written |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
#> # A tibble: 131 × 2
#> var_1 var_2
#> <chr> <chr>
#> 1 a FIL
#> 2 a FIL
#> 3 a FIL
#> 4 a FIL
#> 5 a FIL
#> 6 a FIL
#> 7 a FIL
#> 8 a FIL
#> 9 a FIL
#> 10 a FIL
#> # ℹ 121 more rows
tbl_written |> dplyr::count(var_1) |> dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! IOError: Couldn't deserialize thrift: No more data to read.
#> Deserializing page header failed.
#> Backtrace:
#> ▆
#> 1. ├─dplyr::collect(dplyr::count(tbl_written, var_1))
#> 2. └─arrow:::collect.arrow_dplyr_query(dplyr::count(tbl_written, var_1))
#> 3. └─arrow:::compute.arrow_dplyr_query(x)
#> 4. └─base::tryCatch(...)
#> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 6. └─base (local) tryCatchOne(expr, names, parentenv,
handlers[[1L]])
#> 7. └─value[[3L]](cond)
#> 8. └─arrow:::augment_io_error_msg(e, call, schema = schema())
#> 9. └─rlang::abort(msg, call = call)
```
This is `arrow::arrow_info()`:
``` r
arrow::arrow_info()
#> Arrow package version: 14.0.0
#>
#> Capabilities:
#>
#> acero TRUE
#> dataset TRUE
#> substrait FALSE
#> parquet TRUE
#> json TRUE
#> s3 TRUE
#> gcs TRUE
#> utf8proc TRUE
#> re2 TRUE
#> snappy TRUE
#> gzip TRUE
#> brotli TRUE
#> zstd TRUE
#> lz4 TRUE
#> lz4_frame TRUE
#> lzo FALSE
#> bz2 TRUE
#> jemalloc FALSE
#> mimalloc TRUE
#>
#> Arrow options():
#>
#> arrow.use_threads FALSE
#>
#> Memory:
#>
#> Allocator mimalloc
#> Current 0 bytes
#> Max 0 bytes
#>
#> Runtime:
#>
#> SIMD Level avx2
#> Detected SIMD Level avx2
#>
#> Build:
#>
#> C++ Library Version 14.0.0
#> C++ Compiler GNU
#> C++ Compiler Version 10.3.0
#> Git ID 2dcee3f82c6cf54b53a64729fd81840efa583244
```
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]