michaelgaunt404 opened a new issue, #40255: URL: https://github.com/apache/arrow/issues/40255
### Describe the usage question you have. Please include as many useful details as possible. Is there a tidyr::unnest equivalent for Arrow datasets with multiple Parquet files? I need to handle close to a hundred terabytes of Parquet files. Each file has an attribute with nested tables, and within these tables, there's another attribute containing OpenStreetMap IDs that require filtering. I need to cross-reference these IDs with attributes from another index. If it were a flat file or a long "tidy" data frame, it wouldn't be an issue, but the nested structure is complicating matters with the Arrow dataset object. Currently, I employ an iterative approach, loading individual Parquet files into memory for filtering and saving (Actually I do this in parallel with the avaible cores on my computer). However, I've come across Arrow datasets, and the ability to lazily define operations before loading the object could greatly enhance speed. See below images for reference of the data Im working with.   ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
