Pear0 opened a new pull request, #47860:
URL: https://github.com/apache/arrow/pull/47860
### Rationale for this change
While arrow is designed for tall tables, sometimes I end up with a really
wide table and the `to_pandas()` code seems to use N^2 memory where N is the
number of extension array columns.
For example, the following snippet peaks at about 7GB of memory used on my
machine:
```python
import pyarrow as pa
import pandas as pd
t = pa.table({f'col_{i}': pa.array([], type=pa.int64()) for i in
range(10000)})
# with just this line, the process uses 7GB of memory at peak
t.to_pandas(types_mapper={pa.int64(): pd.ArrowDtype(pa.int64())}.get)
# with just this line, the process uses 118MB of memory at peak
t.to_pandas()
````
With this change, the extension array variation takes ~192MB of memory.
From what I can tell, this is because the `PandasOptions` struct is copied
around frequently (for example it seems like there is an `ExtensionWriter` for
each extension column and each `ExtensionWriter` has a copy of `PandasOptions`
which has a set of all extension columns). I haven't fully traced the
PandasOptions structure, but it seems to get copied and modified in some
codepaths so I have decided to put the column sets into a `std::shared_ptr`
rather than pass around a `shared_ptr<PandasOptions>`.
### What changes are included in this PR?
The `PandasOptions` column sets have been swapped from
`std::unordered_set<std::string>` to `std::shared_ptr<const
std::unordered_set<std::string>>` and usages have been updated.
### Are these changes tested?
Yes. Also tested memory usage by hand.
### Are there any user-facing changes?
- `PandasOptions` - `categorical_columns` and `extension_columns` have
changed. Helper accessor functions have been added.
**This PR contains a "Critical Fix".**
I suppose it is technically a critical fix because you can trivially OOM
your process using the repro above and a larger number of columns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]