[I] [R] write_parquet() a large data.table with index will make the parquet file unreadable [arrow]

via GitHub Mon, 12 Feb 2024 09:12:38 -0800


Yannaubineau opened a new issue, #40050:
URL: https://github.com/apache/arrow/issues/40050


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Saving a data.frame with a big attribute (like an index commonly used in the 
`data.table` package) will make the parquet file unreadable and produce this 
error :
   
   ```
   Error in `open_dataset()`:
   ! IOError: Error creating dataset. Could not read schema from 
'path/example.parquet'. 
   Is this a 'parquet' file?: 
   Could not open Parquet input source 'path/example.parquet':
   Couldn't deserialize thrift: TProtocolException: Exceeded size limit
   ```
   
   Bug as understood by the [stackoverflow 
issue](https://stackoverflow.com/questions/77982801/error-package-arrowr-read-parquet-open-dataset-couldnt-deserialize-thrift-t)
 : The normal efficiency of binary-data-storage in parquet files is not 
afforded to R attributes, so a big attribute (like `data.table` indexes) would 
break the format.
   
   ### Repex : 
   ```r
   library(arrow)
   library(data.table)
   # Seed
   set.seed(1L)
   # Big enough data.table 
   dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L)) 
   # Save in parquet format
   write_parquet(dt, "example_ok.parquet")
   # Readable
   dt_ok <- open_dataset("example_ok.parquet")
   # Simple filter 
   dt[x == 989L]
   # Save in parquet format
   write_parquet(dt, "example_error.parquet")
   # Error
   dt_error <- open_dataset("example_error.parquet")
   ```
   
   
   ### Component(s)
   
   Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [R] write_parquet() a large data.table with index will make the parquet file unreadable [arrow]

Reply via email to