[GitHub] [arrow] lidavidm commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

GitBox Wed, 24 Mar 2021 10:01:44 -0700


lidavidm commented on a change in pull request #9730:
URL: https://github.com/apache/arrow/pull/9730#discussion_r600682362




##########
File path: docs/source/python/pandas.rst
##########
@@ -293,3 +293,19 @@ Used together, the call
 
 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.
+
+Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
+conversion happens column by column, memory is also freed column by column. But
+if multiple columns share an underlying allocation, then no memory will be
+freed until all of those columns are converted. In particular, data that comes
+from IPC or Flight is prone to this, as memory will be laid out as follows::
+
+  Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
+  Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
+  ...
+
+In this case, no memory can be freed until the entire table is converted, even
+with ``self_destruct=True``.
+
+Additionally, even if memory is freed by Arrow, depending on the allocator in
+use, the memory may not be returned to the operating system immediately.

Review comment:
       Maybe we could make it part of a dedicated section for (Py)Arrow 
memory/allocator behavior? For instance, to document different allocators and 
their tradeoffs, and allocator-specific things like JE_ARROW_MALLOC_CONF.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

Reply via email to