[I] [R] Embedded nuls in Parquets are not correctly signalled [arrow]

via GitHub Mon, 20 Nov 2023 09:20:21 -0800


lgaborini opened a new issue, #38804:
URL: https://github.com/apache/arrow/issues/38804


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   With {arrow} 14.0.0, I was able to import a large number of CSVs, merge the 
schemas, establish a partitioning and write them with `arrow::write_dataset()`. 
 
   
   The datasets can be manipulated in-memory and queried without issues, but 
most queries fail once the dataset is read from disk.
   The number of rows is also different.
   
   While building the reprex, I was able to pin down the error thanks to to a 
base R error mentioning the presence of embedded nuls in the partitioning 
variable.
   This is not always reproducible, though.
   
   Also, queries fail at random points: it sounds like a racing condition, but 
threads are disabled according to `arrow.use_threads`.
   
   Reprex [here](https://github.com/lgaborini/arrow-reprex)
   
   If one has tips on how to further trim down the dataset, I can refine it.
   
   
   ## Example 
   
   ### Reading
   
   No issues here:
   
   ``` r
   tbl_input <- 
readr::read_rds(fs::path_expand("~/temp/reprex/data_reprex_small.rds"))
   
   
   # It prints correctly
   tbl_input
   #> # A tibble: 245 × 2
   #>    var_1 var_2
   #>    <chr> <chr>
   #>  1 a     FIL  
   #>  2 a     FIL  
   #>  3 a     FIL  
   #>  4 a     FIL  
   #>  5 a     FIL  
   #>  6 a     FIL  
   #>  7 a     FIL  
   #>  8 a     FIL  
   #>  9 a     FIL  
   #> 10 a     FIL  
   #> # ℹ 235 more rows
   tbl_input$var_1
   #>   [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #>  [19] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #>  [37] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #>  [55] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #>  [73] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #>  [91] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #> [109] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #> [127] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #> [145] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #> [163] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #> [181] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #> [199] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #> [217] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" 
"a"
   #> [235] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
   tbl_input$var_2
   #>   [1] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [11] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [21] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [31] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [41] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [51] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [61] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [71] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [81] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #>  [91] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #> [101] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #> [111] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FIL" 
   #> [121] "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  "FIL"  
"FILX"
   #> [131] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [141] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [151] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [161] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [171] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [181] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [191] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [201] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [211] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [221] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [231] "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" "FILX" 
"FILX"
   #> [241] "FILX" "FIL"  "FIL"  "FILx" "FILx"
   
   # Input number of rows
   nrow(tbl_input)
   #> [1] 245
   
   
   # Input queries: OK
   
   tbl_input |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
   #> # A tibble: 131 × 2
   #>    var_1 var_2
   #>    <chr> <chr>
   #>  1 a     FIL  
   #>  2 a     FIL  
   #>  3 a     FIL  
   #>  4 a     FIL  
   #>  5 a     FIL  
   #>  6 a     FIL  
   #>  7 a     FIL  
   #>  8 a     FIL  
   #>  9 a     FIL  
   #> 10 a     FIL  
   #> # ℹ 121 more rows
   
   tbl_input |> dplyr::count(var_1) |> dplyr::collect()
   #> # A tibble: 1 × 2
   #>   var_1     n
   #>   <chr> <int>
   #> 1 a       245
   
   
   ```
   
   ### Writing 
   
   ```r
   f_dataset_merged <- tempfile()
   
   tbl_input |>
      arrow::write_dataset(
         path = f_dataset_merged,
         partitioning = "var_2"
      )
   
   ```
   
   ### Reading from disk
   
   Usually `arrow::open_dataset()` is s silent (I used also to get errors 
here).  
   Subsequent queries might fail:
   
   ```r
   tbl_written <- arrow::open_dataset(
      sources = f_dataset_merged,
      partitioning = arrow::hive_partition()
   )
   
   # Queries that might or may not fail
   
   # Different number of rows
   nrow(tbl_written)
   #> Error: IOError: Could not open Parquet input source 
'/Temp/RtmpaKq6Pd/file3598288b2763/var_2=FILX/part-0.parquet': Couldn't 
deserialize thrift: TProtocolException: Invalid data
   
   tbl_written |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
   #> # A tibble: 131 × 2
   #>    var_1 var_2
   #>    <chr> <chr>
   #>  1 a     FIL  
   #>  2 a     FIL  
   #>  3 a     FIL  
   #>  4 a     FIL  
   #>  5 a     FIL  
   #>  6 a     FIL  
   #>  7 a     FIL  
   #>  8 a     FIL  
   #>  9 a     FIL  
   #> 10 a     FIL  
   #> # ℹ 121 more rows
   
   tbl_written |> dplyr::count(var_1) |> dplyr::collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! IOError: Could not open Parquet input source 
'/Temp/RtmpaKq6Pd/file3598288b2763/var_2=FILX/part-0.parquet': Couldn't 
deserialize thrift: TProtocolException: Invalid data
   #> Backtrace:
   #>     ▆
   #>  1. ├─dplyr::collect(dplyr::count(tbl_written, var_1))
   #>  2. └─arrow:::collect.arrow_dplyr_query(dplyr::count(tbl_written, var_1))
   #>  3.   └─arrow:::compute.arrow_dplyr_query(x)
   #>  4.     └─base::tryCatch(...)
   #>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>  6.         └─base (local) tryCatchOne(expr, names, parentenv, 
handlers[[1L]])
   #>  7.           └─value[[3L]](cond)
   #>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
   #>  9.               └─rlang::abort(msg, call = call)
   ```
   
   <sup>Created on 2023-11-20 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>
   
   
   ### Another failure mode
   
   Just by relaunching the reprex, the reading parts fails with another path.
   The row number can be computed (but it's different):
   
   ```r
   
   # Queries that might or may not fail
   
   # Different number of rows
   nrow(tbl_written)
   #> [1] 243
   
   tbl_written |> dplyr::filter(var_2 == "FIL") |> dplyr::collect()
   #> # A tibble: 131 × 2
   #>    var_1 var_2
   #>    <chr> <chr>
   #>  1 a     FIL  
   #>  2 a     FIL  
   #>  3 a     FIL  
   #>  4 a     FIL  
   #>  5 a     FIL  
   #>  6 a     FIL  
   #>  7 a     FIL  
   #>  8 a     FIL  
   #>  9 a     FIL  
   #> 10 a     FIL  
   #> # ℹ 121 more rows
   
   tbl_written |> dplyr::count(var_1) |> dplyr::collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! IOError: Couldn't deserialize thrift: No more data to read.
   #> Deserializing page header failed.
   #> Backtrace:
   #>     ▆
   #>  1. ├─dplyr::collect(dplyr::count(tbl_written, var_1))
   #>  2. └─arrow:::collect.arrow_dplyr_query(dplyr::count(tbl_written, var_1))
   #>  3.   └─arrow:::compute.arrow_dplyr_query(x)
   #>  4.     └─base::tryCatch(...)
   #>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>  6.         └─base (local) tryCatchOne(expr, names, parentenv, 
handlers[[1L]])
   #>  7.           └─value[[3L]](cond)
   #>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
   #>  9.               └─rlang::abort(msg, call = call)
   ```
   
   
   This is `arrow::arrow_info()`:
   
   ``` r
   arrow::arrow_info()
   #> Arrow package version: 14.0.0
   #> 
   #> Capabilities:
   #>                
   #> acero      TRUE
   #> dataset    TRUE
   #> substrait FALSE
   #> parquet    TRUE
   #> json       TRUE
   #> s3         TRUE
   #> gcs        TRUE
   #> utf8proc   TRUE
   #> re2        TRUE
   #> snappy     TRUE
   #> gzip       TRUE
   #> brotli     TRUE
   #> zstd       TRUE
   #> lz4        TRUE
   #> lz4_frame  TRUE
   #> lzo       FALSE
   #> bz2        TRUE
   #> jemalloc  FALSE
   #> mimalloc   TRUE
   #> 
   #> Arrow options():
   #>                        
   #> arrow.use_threads FALSE
   #> 
   #> Memory:
   #>                   
   #> Allocator mimalloc
   #> Current    0 bytes
   #> Max        0 bytes
   #> 
   #> Runtime:
   #>                         
   #> SIMD Level          avx2
   #> Detected SIMD Level avx2
   #> 
   #> Build:
   #>                                                              
   #> C++ Library Version                                    14.0.0
   #> C++ Compiler                                              GNU
   #> C++ Compiler Version                                   10.3.0
   #> Git ID               2dcee3f82c6cf54b53a64729fd81840efa583244
   ```
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [R] Embedded nuls in Parquets are not correctly signalled [arrow]

Reply via email to