thinkharderdev commented on issue #5770: URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117816010
> > the files sizes about 1TB > > Err... Is this a reasonable size for a single parquet file? I'm more accustomed to seeing parquet files on the order of 100MB to single digit GB, with a separate catalog combining multiple files together for query 1TB does seem to be quite large but using smaller files requires reading more metadata (with their associated IO). In out system we write files in the range of 50-300MB and reading metadata was so expensive on large queries (we measured it as taking ~30% of total query processing time in some cases) that we built an entire separate system similar to https://github.com/G-Research/PalletJack to deal with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
