wjones127 commented on PR #15210:
URL: https://github.com/apache/arrow/pull/15210#issuecomment-1372870801
I re-ran the original reproduction and it seems memory usage is no longer
quadratic:
| Num rows | Memory usage (10.0.1) | Memory usage (after) |
| ---: | --: | ---: |
| 256k | 2,153,767,662 | 1,102,736,461 |
| 512k | 8,496,047,798 | 2,185,596,364 |
<details>
<summary>Code for test</summary>
Write test file:
```python
import numpy as np
import random
import string
import tracemalloc
import pyarrow as pa
import pyarrow.parquet as pq
_characters = string.ascii_uppercase + string.digits + string.punctuation
def make_random_string(N=10):
return ''.join(random.choice(_characters) for _ in range(N))
nrows = 256_000
filename = 'nested_pandas.parquet'
arr_len = 10
nested_col = []
for i in range(nrows):
nested_col.append(np.array(
[{
'a': None if i % 1000 == 0 else np.random.choice(10000,
size=3).astype(np.int64),
'b': None if i % 100 == 0 else random.choice(range(100)),
'c': None if i % 10 == 0 else make_random_string(5)
} for i in range(arr_len)]
))
table = pa.table({'c1': nested_col})
# table = pa.table({
# 'c1': pa.array([list(range(random.randint(1, 20))) for _ in
range(nrows)])
# })
# Writing to .parquet and loading it into arrow again
pq.write_table(table, filename)
```
Then measure:
```python
import tracemalloc
import pyarrow.parquet as pq
filename =
'/Users/willjones/Documents/arrows/arrow/python/nested_pandas.parquet'
tracemalloc.start()
table_from_parquet = pq.read_table(filename)
out = table_from_parquet.to_pandas()
print(tracemalloc.get_traced_memory())
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]