lidavidm opened a new pull request #29818:
URL: https://github.com/apache/spark/pull/29818
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->
This is a proof-of-concept that I'd like to put up for discussion. I
originally put this on the [mailing
list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html).
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section
is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class
hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other
DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
Creating a Pandas dataframe via Apache Arrow currently can use twice as much
memory as the final result, because during the conversion, both Pandas and
Arrow retain a copy of the data. Arrow has a "self-destruct" mode now (Arrow >=
0.16) to avoid this, by freeing each column after conversion. This PR
integrates support for this in toPandas, handling a couple of edge cases:
self_destruct has no effect unless the memory is allocated appropriately,
which is handled in the Arrow serializer here. Essentially, the issue is that
self_destruct frees memory column-wise, but Arrow record batches are oriented
row-wise:
```
Record batch 0: allocation 0: column 0 chunk 0, column 1 chunk 0, ...
Record batch 1: allocation 1: column 0 chunk 1, column 1 chunk 1, ...
```
In this scenario, Arrow will drop references to all of column 0's chunks,
but no memory will actually be freed, as the chunks were just slices of an
underlying allocation. The PR copies each column into its own allocation so
that memory is instead arranged as so:
```
Record batch 0: allocation 0 column 0 chunk 0, allocation 1 column 1 chunk
0, ...
Record batch 1: allocation 2 column 0 chunk 1, allocation 3 column 1 chunk
1, ...
```
This also adds a user-facing option to disable the optimization. I'd
appreciate feedback on the API here (e.g. should we instead have a global
option?) We can't always apply this optimization because it's more likely to
generate a dataframe with immutable buffers, which Pandas doesn't always handle
well, and because it is slower overall (since it only converts one column at a
time instead of in parallel).
### Why are the changes needed?
This lets us load larger datasets - in particular, with N bytes of memory,
before we could never load a dataset bigger than N/2 bytes; now the overhead is
more like N/1.25 or so.
### Does this PR introduce _any_ user-facing change?
Yes - it adds a new option, for which I haven't added docs yet.
### How was this patch tested?
See the [mailing
list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html)
- it was tested with Python memory_profiler. I will also add unit tests to
test correctness, but testing memory usage is unreliable.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]