fjetter commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1774845268

   Ok, fun experiment. I wrapped the above script in a function `run_benchmark` 
and ran this on my machine...
   
   
![image](https://github.com/apache/arrow/assets/8629629/a69c8e8b-d005-4b94-a224-75f01200cf4a)
   
   
   Looks like the simple fact that we're running this in the dask environment 
is slowing us down quite a bit. This also biases most/all Coiled-based cloud 
benchmarks
   
   <details>
   <Summary>Code</Summary>
   
   ```python
   # Create dataset
   import pyarrow as pa
   import pyarrow.parquet as pq
   import numpy as np
   import pandas as pd
   import time
   import io
   def run_benchmark():
       from distributed.worker import print
       x = np.random.randint(0, 100000, size=(1000000, 100))
       df = pd.DataFrame(x)
       t = pa.Table.from_pandas(df)
       niterations = 20
       
       # Write to local parquet file
       
       pq.write_table(t, "foo.parquet")
       
       
       # Time Disk speeds
       
       start = time.perf_counter()
       for _ in range(niterations):
           with open("foo.parquet", mode="rb") as f:
               bytes = f.read()
               nbytes = len(bytes)
       stop = time.perf_counter()
       print("Disk Bandwidth:", int(nbytes / ((stop - start) / niterations) / 
2**20), "MiB/s")
       # Time Arrow Parquet Speeds
       
       start = time.perf_counter()
       for _ in range(niterations):
           pq.read_table("foo.parquet")
       stop = time.perf_counter()
       print("PyArrow Read Bandwidth:", int(nbytes / ((stop - start) / 
niterations) / 2**20), "MiB/s")
       
       # Time In-Memory Read Speeds
       
       start = time.perf_counter()
       for _ in range(niterations):
           pq.read_table(io.BytesIO(bytes))
       stop = time.perf_counter()
       
       print("PyArrow In-Memory Bandwidth:", int(nbytes / ((stop - start) / 
niterations) / 2**20), "MiB/s")
       
       # Time In-Memory Read Speeds
       
       start = time.perf_counter()
       for _ in range(niterations):
           pq.read_table(io.BytesIO(bytes)).to_pandas()
       stop = time.perf_counter()
       
       print("PyArrow (to_pandas) Bandwidth:", int(nbytes / ((stop - start) / 
niterations) / 2**20), "MiB/s")
   
   run_benchmark()
   
   from distributed import Client
   client = Client()
   
   client.submit(run_benchmark).result()
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to