[GitHub] [arrow] lidavidm commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

GitBox Wed, 17 Mar 2021 07:42:28 -0700


lidavidm commented on a change in pull request #9730:
URL: https://github.com/apache/arrow/pull/9730#discussion_r596083089




##########
File path: docs/source/python/pandas.rst
##########
@@ -293,3 +293,19 @@ Used together, the call
 
 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.
+
+Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
+conversion happens column by column, memory is also freed column by column. But
+if multiple columns share an underlying allocation, then no memory will be

Review comment:
       Yes, multiple arrays may share the same buffer. In IPC for instance, we 
read a record batch's worth of data from the file at a time, and hence all 
arrays in that batch share the same buffer. In Flight, similarly, we receive a 
record batch's worth of data from gRPC and (for implementation reasons) 
concatenate it into a single buffer, so we end up in the same situation. (I 
don't think this applies to, say, Parquet or CSV, since there's actual decoding 
for those formats, but haven't tested it.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

Reply via email to