pitrou commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2606567048
> I was expecting the metadata memory usage to be more of O(C) where C=number_columns instead of O(C * F) where C=number_columns and F=number_files? Since once a parquet file is loaded to pyarrow Table, we don't need to keep the metadata around (all files have the same scheme), but perhaps I am misunderstanding how read parquet works. Hmm, this needs clarifying a bit then :) What do the memory usage numbers you posted represent? Is it peak memory usage? Is it memory usage after loading the dataset as a Arrow table? Is the dataset object still alive at that point? > It would be great to reduce metadata memory usage when the files being read all have the same schema since this is a quite common case I think Definitely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
