Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Fri, 17 May 2024 08:12:22 -0700


thinkharderdev commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117816010


   > > the files sizes about 1TB
   > 
   > Err... Is this a reasonable size for a single parquet file? I'm more 
accustomed to seeing parquet files on the order of 100MB to single digit GB, 
with a separate catalog combining multiple files together for query
   
   1TB does seem to be quite large but using smaller files requires reading 
more metadata (with their associated IO). In out system we write files in the 
range of 50-300MB and reading metadata was so expensive on large queries (we 
measured it as taking ~30% of total query processing time in some cases) that 
we built an entire separate system similar to 
https://github.com/G-Research/PalletJack to deal with it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to