jorisvandenbossche commented on issue #13320:
URL: https://github.com/apache/arrow/issues/13320#issuecomment-1352639266

   I don't think this is using "too much" memory, but this is expected for this 
kind of data at the moment. 
   
   A quick rough calculation: you mention the data has 4.2 million rows and 50 
columns. For the int64 and float64 columns, that is 8 bytes per element. But 
most of your columns seem to be list arrays. Those get converted to a column of 
numpy arrays in pandas, and even an empty numpy array already consumes 104 
bytes per array (`sys.getsizeof(np.array([]))`), so this is unfortunately not 
very efficient.
   Assuming this size of dataframe with just empty arrays in it:
   
   ```
   >>> (4200000 * 50 * 104) / 1024**3
   20.34008502960205
   ```
   
   already gives 20GB. Assuming that there is some data in those columns, that 
will be more, and thus getting in the order of the 28GB you mention.
   
   So while this is currently "expected", it's of course not efficient. That is 
due to pandas not having a built-in list data type (like arrow has). And thus 
the conversion from pyarrow -> pandas has to do something with those lists. 
Currently pyarrow converts it to an array of arrays, but if you have many tiny 
arrays, that's unfortunately not efficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to