Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Tue, 24 Oct 2023 06:47:19 -0700


jorisvandenbossche commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1777242835


   Re-running the benchmarks with a slightly adapted script from above (single 
threaded, different compressions), and ensuring I run it while having no other 
applications running, I actually get quite decent single threaded performance:
   
   <details>
   <summary>Code</summary>
   
   ```python
   
   import pyarrow as pa
   import pyarrow.parquet as pq
   import numpy as np
   import pandas as pd
   import time
   import io
   
   # Create datasets
   x = np.random.randint(0, 100000, size=(1000000, 100))
   df = pd.DataFrame(x)
   t = pa.Table.from_pandas(df)
   pq.write_table(t, "foo.parquet")
   pq.write_table(t, "foo-lz4.parquet", compression="lz4")
   pq.write_table(t, "foo-uncompressed.parquet", compression="none")
   
   def run_benchmark(fname):
   
   
       niterations = 20
   
       # Time Disk speeds
       
       start = time.perf_counter()
       for _ in range(niterations):
           with open(fname, mode="rb") as f:
               bytes = f.read()
               nbytes = len(bytes)
       stop = time.perf_counter()
       print("Disk Bandwidth:", int(nbytes / ((stop - start) / niterations) / 
2**20), "MiB/s")
       # Time Arrow Parquet Speeds
       
       start = time.perf_counter()
       for _ in range(niterations):
           pq.read_table(fname, use_threads=False)
       stop = time.perf_counter()
       print("PyArrow Read Bandwidth:", int(nbytes / ((stop - start) / 
niterations) / 2**20), "MiB/s")
       
       # Time In-Memory Read Speeds
       
       start = time.perf_counter()
       for _ in range(niterations):
           pq.read_table(io.BytesIO(bytes), use_threads=False)
       stop = time.perf_counter()
       
       print("PyArrow In-Memory Bandwidth:", int(nbytes / ((stop - start) / 
niterations) / 2**20), "MiB/s")
       
       # Time In-Memory Read Speeds
       
       start = time.perf_counter()
       for _ in range(niterations):
           pq.read_table(io.BytesIO(bytes), 
use_threads=False).to_pandas(use_threads=False)
       stop = time.perf_counter()
       
       print("PyArrow (to_pandas) Bandwidth:", int(nbytes / ((stop - start) / 
niterations) / 2**20), "MiB/s")
   ```
   
   </details>
   
   ```
   In [3]: run_benchmark("foo.parquet")
   Disk Bandwidth: 2052 MiB/s
   PyArrow Read Bandwidth: 436 MiB/s
   PyArrow In-Memory Bandwidth: 459 MiB/s
   PyArrow (to_pandas) Bandwidth: 280 MiB/s
   
   In [4]: run_benchmark("foo-lz4.parquet")
   Disk Bandwidth: 2100 MiB/s
   PyArrow Read Bandwidth: 516 MiB/s
   PyArrow In-Memory Bandwidth: 569 MiB/s
   PyArrow (to_pandas) Bandwidth: 323 MiB/s
   
   In [5]: run_benchmark("foo-uncompressed.parquet")
   Disk Bandwidth: 2092 MiB/s
   PyArrow Read Bandwidth: 667 MiB/s
   PyArrow In-Memory Bandwidth: 730 MiB/s
   PyArrow (to_pandas) Bandwidth: 409 MiB/s
   ```
   
   And the file sizes are 258, 255 and 293 MB, respectively (so the actual 
speedup for uncompressed is a bit lower than what the above gives, because it's 
reading more MBs. But it's still faster in terms of seconds to read)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to