Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Tue, 24 Oct 2023 06:43:23 -0700


jorisvandenbossche commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1777234366


   > > But for linux you mentioned above a similar number with threads. So that 
means you see hardly any perf improvement with threads on linux
   > 
   > Yes. That's correct. To be clear though, I'm currently more confused about 
only getting 150-200 MB/s deserializing integers on a single thread. That seems 
very strange to me.
   
   Yes, I understand (and dask uses `use_threads=False` anyway, so mostly 
depends on this single threaded performance). But then to not mix too many 
different issues at once, it might be better to focus the various timings in 
this issue on single threaded performance.
   
   > Parquet -> Arrow has to do a nontrivial amount of work
   
   Parquet is indeed a complex file format. In addition to the decompression, 
there is also the decoding (although the file here will use dictionary 
encoding, and that should be quite fast I would expect. Also quickly testing 
plain and delta_binary_packed encodings, and that actually gives slower reads 
than the default in this case). 
   
   I was also wondering if we could have an idea which bandwidth one can expect 
for just the decompression, to have some point of comparison. The snappy readme 
(https://github.com/google/snappy) itself mentions decompression at 500MB/s for 
Intel Core i7. Running the snippet of Florian above, I actually only get around 
100MB/s for the SNAPPY decompression..
   
   > Arrow uses SNAPPY compression by default
   
   Quickly testing with another compression (`pq.write_table(t, 
"foo_lz4.parquet", compression="lz4")`, I get consistently faster reads with 
LZ4 compared to SNAPPY for this dataset, but only around 5-10% faster. Not a 
huge difference, but so in general one can always tweak the encoding and 
compression settings for their specific datasets to achieve optimal read 
performance. 
   
   Using no compression at all (`compression="none"`) also gives some speed-up 
(but of course trading storage size with read speed, and on eg S3 that might 
not even be beneficial)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to