aoli-al opened a new issue, #49612:
URL: https://github.com/apache/arrow/issues/49612
### Describe the bug, including details regarding any error messages,
version, and platform.
`pd.options.mode.string_storage = "pyarrow"` causes a large slowdown when
repeatedly growing a string-typed `DataFrame` with `loc` row assignment.
The performance issue largely goes away if I switch to:
```python
pd.options.mode.string_storage = "python"
```
## Versions
```text
pandas=3.0.1
pyarrow=23.0.1
python=3.12
platform=Linux
```
## Minimal reproducer
```python
import time
import pandas as pd
import pyarrow as pa
def bench(storage: str, rows: int = 1000, cols: int = 20) -> float:
pd.options.mode.string_storage = storage
source = pd.DataFrame(
[[f"v{j % 10}" for j in range(cols)] for _ in range(rows)]
).astype(str)
out = pd.DataFrame(columns=source.columns).astype(str)
start = time.perf_counter()
for i, row in enumerate(source.itertuples(index=False)):
out.loc[i] = row
return time.perf_counter() - start
print(f"pandas={pd.__version__} pyarrow={pa.__version__}")
for storage in ("python", "pyarrow"):
elapsed = bench(storage)
print(storage, elapsed)
```
## Output on my machine
```text
pandas=3.0.1 pyarrow=23.0.1 rows=1000 cols=20
storage=python array=StringArray seconds=0.420
storage=pyarrow array=ArrowStringArray seconds=3.316
slowdown(pyarrow/python)=7.89x
```
I also see the same pattern with smaller sizes, for example:
```text
500x10: python=0.147s pyarrow=0.508s
500x20: python=0.200s pyarrow=0.930s
1000x10: python=0.292s pyarrow=1.759s
1000x20: python=0.411s pyarrow=3.358s
1500x20: python=0.624s pyarrow=7.174s
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]