alamb commented on issue #5770: URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2135574125
@XiangpengHao and @tustvold and @wiedld spoke a little about this, and here was my understanding of the plan: We will run experiments (perhaps based on @tustvold 's tool from https://github.com/apache/arrow-rs/issues/5770#issuecomment-2116370344) Specifically create these scenario: 1. Create schemas with 1000 columns, 5k columns, 10k columns 20k columns, (maybe 100k columns ) 2. Schema all floats (model a machine learning usecase), 3. Write a parquet file with 10M rows, with 1M row row groups 4. Try with three different writer settings: 1. default (use the rust writer defaults) , 2. turn off all statistics, etc to minimize the size of the metadata, 3. maximum statistics (turn on full statistics, including page level statistics, don't truncate statistics lenghts, etc The for each scenario, measure: 1. Size of metadata footer (in bytes) 2. Time to decode (time to get `ParquetMetaData` file) This should result in a table like for each of "metadata size in bytes" and "decode perfor | Writer Properties | 1k columns | 2k coumns | 5k columns | 10k columns | 20k columns | |--------|--------|--------|--------|--------|--------| | Default Properties | X | X | X | X | X | | Minimum Statistics | X | X | X | X | X | | Maximum Statistics | X | X | X | X | X | Other potential experiments to run: * Add string/binary columns to the schema (maybe 10% of the columns) -- I expect to see the metadata be much larger * Try other writer settings (e.g. show the effect of metadata truncation) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
