jorisvandenbossche commented on issue #13320: URL: https://github.com/apache/arrow/issues/13320#issuecomment-1352639266
I don't think this is using "too much" memory, but this is expected for this kind of data at the moment. A quick rough calculation: you mention the data has 4.2 million rows and 50 columns. For the int64 and float64 columns, that is 8 bytes per element. But most of your columns seem to be list arrays. Those get converted to a column of numpy arrays in pandas, and even an empty numpy array already consumes 104 bytes per array (`sys.getsizeof(np.array([]))`), so this is unfortunately not very efficient. Assuming this size of dataframe with just empty arrays in it: ``` >>> (4200000 * 50 * 104) / 1024**3 20.34008502960205 ``` already gives 20GB. Assuming that there is some data in those columns, that will be more, and thus getting in the order of the 28GB you mention. So while this is currently "expected", it's of course not efficient. That is due to pandas not having a built-in list data type (like arrow has). And thus the conversion from pyarrow -> pandas has to do something with those lists. Currently pyarrow converts it to an array of arrays, but if you have many tiny arrays, that's unfortunately not efficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
