Yannaubineau opened a new issue, #40050: URL: https://github.com/apache/arrow/issues/40050
### Describe the bug, including details regarding any error messages, version, and platform. Saving a data.frame with a big attribute (like an index commonly used in the `data.table` package) will make the parquet file unreadable and produce this error : ``` Error in `open_dataset()`: ! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'. Is this a 'parquet' file?: Could not open Parquet input source 'path/example.parquet': Couldn't deserialize thrift: TProtocolException: Exceeded size limit ``` Bug as understood by the [stackoverflow issue](https://stackoverflow.com/questions/77982801/error-package-arrowr-read-parquet-open-dataset-couldnt-deserialize-thrift-t) : The normal efficiency of binary-data-storage in parquet files is not afforded to R attributes, so a big attribute (like `data.table` indexes) would break the format. ### Repex : ```r library(arrow) library(data.table) # Seed set.seed(1L) # Big enough data.table dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L)) # Save in parquet format write_parquet(dt, "example_ok.parquet") # Readable dt_ok <- open_dataset("example_ok.parquet") # Simple filter dt[x == 989L] # Save in parquet format write_parquet(dt, "example_error.parquet") # Error dt_error <- open_dataset("example_error.parquet") ``` ### Component(s) Parquet, R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
