Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Fri, 17 May 2024 01:58:46 -0700


tustvold commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117079521


   > we use parquet files with 100 row groups and 50k columns (and this is 
after the dataset has been split into many individual parquet files). What is 
worse, our use case is reading individual row groups and only subset of columns.
   
   How big are these files? At 1MB per page (the recommended size) this would 
come to 5TB?!
   
   For extremely sparse data, such as feature stores, I can't help wondering if 
MapArray or similar would be a better way to encode this. You would lose the 
ability to do projection pushdown in the traditional sense, but maybe this is 
ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to