Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Fri, 17 May 2024 01:35:53 -0700


marcin-krystianc commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117032859


   > 10K columns by 10 row groups by 1M rows is 100B values (400GB with int32). 
I don't think anyone has data like that (this is presumptuous, I am probably 
wrong).
   > 
   > My experience has been either:
   > 
   > * The files they make are much smaller (and thus not enough or undersized 
row groups) E.g. financial data where ticker is column.
   > * The columns are very sparse (and thus a need for better sparse 
encoding).  E.g. feature stores.
   
   Hi, we use parquet files with 100 row groups and 50k columns (and this is 
after the dataset has been split into many individual parquet files). What is 
worse, our use case is reading individual row groups and only subset of columns.
   That makes the cost of reading the entire metadata footer even higher than 
cost of reading the actual data (because we read entire footer and then read 
only tiny subset of the actual data).
   
   To deal with the problem we've implemented a tool that stores an index 
information in a separate file which allows for reading only a necessary subset 
of metadata. (https://github.com/G-Research/PalletJack)..
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to