fjetter commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1774813362

   FWIW I slightly modified the above script to run each operation N times 
since I noticed quite some variance on my machine (M1 2020 MacBook)
   
   <details>
   
   ```python
   # Create dataset
   import pyarrow as pa
   import pyarrow.parquet as pq
   import numpy as np
   import pandas as pd
   import time
   import io
   
   x = np.random.randint(0, 100000, size=(1000000, 100))
   df = pd.DataFrame(x)
   t = pa.Table.from_pandas(df)
   niterations = 20
   
   # Write to local parquet file
   
   pq.write_table(t, "foo.parquet")
   
   
   # Time Disk speeds
   
   start = time.perf_counter()
   for _ in range(niterations):
       with open("foo.parquet", mode="rb") as f:
           bytes = f.read()
           nbytes = len(bytes)
   stop = time.perf_counter()
   print("Disk Bandwidth:", int(nbytes / ((stop - start) / niterations) / 
2**20), "MiB/s")
   
   
   # Time Arrow Parquet Speeds
   
   start = time.perf_counter()
   for _ in range(niterations):
       _ = pq.read_table("foo.parquet")
   stop = time.perf_counter()
   print("PyArrow Read Bandwidth:", int(nbytes / ((stop - start) / niterations) 
/ 2**20), "MiB/s")
   
   
   # Time In-Memory Read Speeds
   
   start = time.perf_counter()
   for _ in range(niterations):
       pq.read_table(io.BytesIO(bytes))
   stop = time.perf_counter()
   
   print("PyArrow In-Memory Bandwidth:", int(nbytes / ((stop - start) / 
niterations) / 2**20), "MiB/s")
   
   # Time In-Memory Read Speeds
   
   start = time.perf_counter()
   for _ in range(niterations):
       pq.read_table(io.BytesIO(bytes)).to_pandas()
   stop = time.perf_counter()
   
   print("PyArrow (to_pandas) Bandwidth:", int(nbytes / ((stop - start) / 
niterations) / 2**20), "MiB/s")
   ```
   
   </details>
   
   and I get 
   
   ```
   Disk Bandwidth: 5154 MiB/s
   PyArrow Read Bandwidth: 2294 MiB/s
   PyArrow In-Memory Bandwidth: 2439 MiB/s
   PyArrow (to_pandas) Bandwidth: 1142 MiB/s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to