[I] How to use a _metadata file for a parquet dataset if the _metadata file is too large for thrift_container_size_limits default? [arrow]

via GitHub Wed, 31 Jan 2024 17:38:08 -0800


JosephWagner opened a new issue, #39873:
URL: https://github.com/apache/arrow/issues/39873


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Let's say I have a very large partitioned parquet dataset. I've written out 
a _metadata file, and it is over 2 GB. So when I try to run this:
   
   `ds.parquet_dataset("parquet_dataset_partitioned/_metadata", 
partitioning="hive")`
   
   I get `Couldn't deserialize thrift: TProtocolException: Exceeded size limit` 
error.
   
   I see that I can use 
`pq.ParquetDataset("parquet_dataset_partitioned/_metadata", 
thrift_container_size_limit=1000000000)`, but the returned object isn't as 
useful as a `FileSystemDataset`. For example, I can't filter on partioning 
columns (because they are not part of the info in the `_metadata` file).
   
   For context, I am trying to minimize the latency of instantiating a 
FileSystemDataset so that it can be used interactively (eg imagine a CLI or 
endpoint). I had assumed that concatenating all the individual file metadata 
into a `_metadata` footer would let me bypass 1) running `stat` on a bunch of 
folders during file discovery and 2) performing a bunch of smaller IO reads for 
each parquet file footer.
   
   I can pass a list of files to `ds.parquet_dataset`, but it still takes a 
little over a second to instantiate. I assume (but haven't verified) that using 
1 big `_metadata` file would be faster. Now that I think about that, maybe 
that's a silly assumption. Perhaps decoding thrift would take most of the time 
either way.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] How to use a _metadata file for a parquet dataset if the _metadata file is too large for thrift_container_size_limits default? [arrow]

Reply via email to