fjetter commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1774813362
FWIW I slightly modified the above script to run each operation N times
since I noticed quite some variance on my machine (M1 2020 MacBook)
<details>
```python
# Create dataset
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import time
import io
x = np.random.randint(0, 100000, size=(1000000, 100))
df = pd.DataFrame(x)
t = pa.Table.from_pandas(df)
niterations = 20
# Write to local parquet file
pq.write_table(t, "foo.parquet")
# Time Disk speeds
start = time.perf_counter()
for _ in range(niterations):
with open("foo.parquet", mode="rb") as f:
bytes = f.read()
nbytes = len(bytes)
stop = time.perf_counter()
print("Disk Bandwidth:", int(nbytes / ((stop - start) / niterations) /
2**20), "MiB/s")
# Time Arrow Parquet Speeds
start = time.perf_counter()
for _ in range(niterations):
_ = pq.read_table("foo.parquet")
stop = time.perf_counter()
print("PyArrow Read Bandwidth:", int(nbytes / ((stop - start) / niterations)
/ 2**20), "MiB/s")
# Time In-Memory Read Speeds
start = time.perf_counter()
for _ in range(niterations):
pq.read_table(io.BytesIO(bytes))
stop = time.perf_counter()
print("PyArrow In-Memory Bandwidth:", int(nbytes / ((stop - start) /
niterations) / 2**20), "MiB/s")
# Time In-Memory Read Speeds
start = time.perf_counter()
for _ in range(niterations):
pq.read_table(io.BytesIO(bytes)).to_pandas()
stop = time.perf_counter()
print("PyArrow (to_pandas) Bandwidth:", int(nbytes / ((stop - start) /
niterations) / 2**20), "MiB/s")
```
</details>
and I get
```
Disk Bandwidth: 5154 MiB/s
PyArrow Read Bandwidth: 2294 MiB/s
PyArrow In-Memory Bandwidth: 2439 MiB/s
PyArrow (to_pandas) Bandwidth: 1142 MiB/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]