JosephWagner opened a new issue, #39873:
URL: https://github.com/apache/arrow/issues/39873
### Describe the usage question you have. Please include as many useful
details as possible.
Let's say I have a very large partitioned parquet dataset. I've written out
a _metadata file, and it is over 2 GB. So when I try to run this:
`ds.parquet_dataset("parquet_dataset_partitioned/_metadata",
partitioning="hive")`
I get `Couldn't deserialize thrift: TProtocolException: Exceeded size limit`
error.
I see that I can use
`pq.ParquetDataset("parquet_dataset_partitioned/_metadata",
thrift_container_size_limit=1000000000)`, but the returned object isn't as
useful as a `FileSystemDataset`. For example, I can't filter on partioning
columns (because they are not part of the info in the `_metadata` file).
For context, I am trying to minimize the latency of instantiating a
FileSystemDataset so that it can be used interactively (eg imagine a CLI or
endpoint). I had assumed that concatenating all the individual file metadata
into a `_metadata` footer would let me bypass 1) running `stat` on a bunch of
folders during file discovery and 2) performing a bunch of smaller IO reads for
each parquet file footer.
I can pass a list of files to `ds.parquet_dataset`, but it still takes a
little over a second to instantiate. I assume (but haven't verified) that using
1 big `_metadata` file would be faster. Now that I think about that, maybe
that's a silly assumption. Perhaps decoding thrift would take most of the time
either way.
### Component(s)
Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]