Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Tue, 24 Oct 2023 03:06:03 -0700


fjetter commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1776908762


   > I found that x.copy() ran in 2 GB/s and pq.read_table(io.BytesIO(bytes)) 
ran in 180 MB/s.
   
   I'm not sure if this comparison is actually fair and valid. Parquet -> Arrow 
has to do a nontrivial amount of work. Even your random data is encoded and 
compressed. (See `pq.ParquetFile("foo.parquet").metadata.to_dict()` to inspect 
the metadata)
   
   
![image](https://github.com/apache/arrow/assets/8629629/daafe8d9-b8de-43c9-b908-c153124e9b1b)
   
   I also ran this on colab and got something like this
   
   ```
   Disk Bandwidth: 1636 MiB/s
   PyArrow Read Bandwidth: 231 MiB/s
   PyArrow In-Memory Bandwidth: 220 MiB/s
   ```
   
   from your benchmark output. I went along and ran
   
   ```python
   import pickle
   pickled_df = pickle.dumps(x)
   compressedb = pa.compress(pickled_df, "SNAPPY")
   nbytes = len(compressedb)
   start = time.time()
   pa.decompress(compressedb, decompressed_size=len(pickled_df), codec="SNAPPY")
   stop = time.time()print("SNAPPY Decompress Bandwidth:", int(nbytes / (stop - 
start) / 2**20), "MiB/s")
   ```
   
    which gives me 
   
   `SNAPPY Decompress Bandwidth: 199 MiB/s`
   
   so we're moving in the same vicinity as the parquet read.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to