Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Tue, 28 May 2024 08:43:15 -0700


alamb commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2135574125


   @XiangpengHao  and @tustvold and @wiedld  spoke a little about this, and 
here was my understanding of the plan:
   
   We will run experiments (perhaps based on @tustvold 's tool from 
https://github.com/apache/arrow-rs/issues/5770#issuecomment-2116370344)
   
   Specifically create these scenario:
   1. Create schemas with 1000 columns, 5k columns, 10k columns 20k columns, 
(maybe 100k columns )
   2. Schema all floats (model a machine learning usecase),
   3. Write a parquet file with 10M rows, with 1M row row groups
   4. Try with three different writer settings: 1. default (use the rust writer 
defaults) , 2. turn off all statistics, etc to minimize the size of the 
metadata, 3. maximum statistics (turn on full statistics, including page level 
statistics, don't truncate statistics lenghts, etc
   
   The for each scenario, measure:
   1. Size of metadata footer (in bytes)
   2. Time to decode (time to get `ParquetMetaData` file)
   
   This should result in a table like for each of "metadata size in bytes" and 
"decode perfor
   
   | Writer Properties | 1k columns | 2k coumns | 5k columns | 10k columns | 
20k columns |
   |--------|--------|--------|--------|--------|--------|
   | Default Properties | X | X | X | X | X |
   | Minimum Statistics | X | X | X | X | X |
   | Maximum Statistics | X | X | X | X | X |
   
   
   Other potential experiments to run: 
   * Add string/binary columns to the schema (maybe 10% of the columns) -- I 
expect to see the metadata be much larger
   * Try other writer settings (e.g. show the effect of metadata truncation)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to