fjetter commented on issue #38389: URL: https://github.com/apache/arrow/issues/38389#issuecomment-1774845268
Ok, fun experiment. I wrapped the above script in a function `run_benchmark` and ran this on my machine...  Looks like the simple fact that we're running this in the dask environment is slowing us down quite a bit. This also biases most/all Coiled-based cloud benchmarks <details> <Summary>Code</Summary> ```python # Create dataset import pyarrow as pa import pyarrow.parquet as pq import numpy as np import pandas as pd import time import io def run_benchmark(): from distributed.worker import print x = np.random.randint(0, 100000, size=(1000000, 100)) df = pd.DataFrame(x) t = pa.Table.from_pandas(df) niterations = 20 # Write to local parquet file pq.write_table(t, "foo.parquet") # Time Disk speeds start = time.perf_counter() for _ in range(niterations): with open("foo.parquet", mode="rb") as f: bytes = f.read() nbytes = len(bytes) stop = time.perf_counter() print("Disk Bandwidth:", int(nbytes / ((stop - start) / niterations) / 2**20), "MiB/s") # Time Arrow Parquet Speeds start = time.perf_counter() for _ in range(niterations): pq.read_table("foo.parquet") stop = time.perf_counter() print("PyArrow Read Bandwidth:", int(nbytes / ((stop - start) / niterations) / 2**20), "MiB/s") # Time In-Memory Read Speeds start = time.perf_counter() for _ in range(niterations): pq.read_table(io.BytesIO(bytes)) stop = time.perf_counter() print("PyArrow In-Memory Bandwidth:", int(nbytes / ((stop - start) / niterations) / 2**20), "MiB/s") # Time In-Memory Read Speeds start = time.perf_counter() for _ in range(niterations): pq.read_table(io.BytesIO(bytes)).to_pandas() stop = time.perf_counter() print("PyArrow (to_pandas) Bandwidth:", int(nbytes / ((stop - start) / niterations) / 2**20), "MiB/s") run_benchmark() from distributed import Client client = Client() client.submit(run_benchmark).result() ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
