Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Mon, 20 May 2024 01:56:59 -0700


marcin-krystianc commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2119993882


   > > the files sizes about 1TB
   > 
   > Err... Is this a reasonable size for a single parquet file? I'm more 
accustomed to seeing parquet files on the order of 100MB to single digit GB, 
with a separate catalog combining multiple files together for query
   
   If the size of entire dataset is tens of TBs then you need to use larger 
files or you end up having a lot of files (many thousands) which is hard to 
manage. Also, thinking about te future of ML, it is clear to me that a 
requirement for even larger datasets is not unrealistic.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to