jorisvandenbossche commented on a change in pull request #9730:
URL: https://github.com/apache/arrow/pull/9730#discussion_r599578642



##########
File path: docs/source/python/pandas.rst
##########
@@ -293,3 +293,19 @@ Used together, the call
 
 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.
+
+Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
+conversion happens column by column, memory is also freed column by column. But
+if multiple columns share an underlying allocation, then no memory will be
+freed until all of those columns are converted. In particular, data that comes
+from IPC or Flight is prone to this, as memory will be laid out as follows::
+
+  Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
+  Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
+  ...
+
+In this case, no memory can be freed until the entire table is converted, even
+with ``self_destruct=True``.
+
+Additionally, even if memory is freed by Arrow, depending on the allocator in
+use, the memory may not be returned to the operating system immediately.

Review comment:
       Personally, I think it's still useful to mention it here, as it is 
indeed a commonly misunderstood point, and users will specifically read this 
section if they are looking into optimizing memory use / trying to understand 
why they don't see an improved memory usage. 
   (the sentence could still add a "as is true for all deallocations" to make 
it clear this isn't specific to this option)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to