icexelloss commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2605797576
> A quick back of the envelope calculation says that this is roughly 2 kB per column per file. I was expecting the metadata memory usage to be more of O(C) where C=number_columns instead of O(C * F) where C=number_columns and F=number_files? Since once a parquet file is loaded to pyarrow Table, we don't need to keep the metadata around (all files have the same scheme), but perhaps I am misunderstanding how read parquet works. > Interesting data point. That would be 4 kB per column per file, so quite a bit of additional overhead just for 128 additional characters... Yeah certainly feels the that there are multiple copies of the string for column name even though all file/partition has the same schema. > I would stress that "a single row and 10 kB columns" is never going to be a good use case for Parquet Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
