jorisvandenbossche commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1777242835
Re-running the benchmarks with a slightly adapted script from above (single
threaded, different compressions), and ensuring I run it while having no other
applications running, I actually get quite decent single threaded performance:
<details>
<summary>Code</summary>
```python
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import time
import io
# Create datasets
x = np.random.randint(0, 100000, size=(1000000, 100))
df = pd.DataFrame(x)
t = pa.Table.from_pandas(df)
pq.write_table(t, "foo.parquet")
pq.write_table(t, "foo-lz4.parquet", compression="lz4")
pq.write_table(t, "foo-uncompressed.parquet", compression="none")
def run_benchmark(fname):
niterations = 20
# Time Disk speeds
start = time.perf_counter()
for _ in range(niterations):
with open(fname, mode="rb") as f:
bytes = f.read()
nbytes = len(bytes)
stop = time.perf_counter()
print("Disk Bandwidth:", int(nbytes / ((stop - start) / niterations) /
2**20), "MiB/s")
# Time Arrow Parquet Speeds
start = time.perf_counter()
for _ in range(niterations):
pq.read_table(fname, use_threads=False)
stop = time.perf_counter()
print("PyArrow Read Bandwidth:", int(nbytes / ((stop - start) /
niterations) / 2**20), "MiB/s")
# Time In-Memory Read Speeds
start = time.perf_counter()
for _ in range(niterations):
pq.read_table(io.BytesIO(bytes), use_threads=False)
stop = time.perf_counter()
print("PyArrow In-Memory Bandwidth:", int(nbytes / ((stop - start) /
niterations) / 2**20), "MiB/s")
# Time In-Memory Read Speeds
start = time.perf_counter()
for _ in range(niterations):
pq.read_table(io.BytesIO(bytes),
use_threads=False).to_pandas(use_threads=False)
stop = time.perf_counter()
print("PyArrow (to_pandas) Bandwidth:", int(nbytes / ((stop - start) /
niterations) / 2**20), "MiB/s")
```
</details>
```
In [3]: run_benchmark("foo.parquet")
Disk Bandwidth: 2052 MiB/s
PyArrow Read Bandwidth: 436 MiB/s
PyArrow In-Memory Bandwidth: 459 MiB/s
PyArrow (to_pandas) Bandwidth: 280 MiB/s
In [4]: run_benchmark("foo-lz4.parquet")
Disk Bandwidth: 2100 MiB/s
PyArrow Read Bandwidth: 516 MiB/s
PyArrow In-Memory Bandwidth: 569 MiB/s
PyArrow (to_pandas) Bandwidth: 323 MiB/s
In [5]: run_benchmark("foo-uncompressed.parquet")
Disk Bandwidth: 2092 MiB/s
PyArrow Read Bandwidth: 667 MiB/s
PyArrow In-Memory Bandwidth: 730 MiB/s
PyArrow (to_pandas) Bandwidth: 409 MiB/s
```
And the file sizes are 258, 255 and 293 MB, respectively (so the actual
speedup for uncompressed is a bit lower than what the above gives, because it's
reading more MBs. But it's still faster in terms of seconds to read)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]