jorisvandenbossche commented on issue #38389: URL: https://github.com/apache/arrow/issues/38389#issuecomment-1777234366
> > But for linux you mentioned above a similar number with threads. So that means you see hardly any perf improvement with threads on linux > > Yes. That's correct. To be clear though, I'm currently more confused about only getting 150-200 MB/s deserializing integers on a single thread. That seems very strange to me. Yes, I understand (and dask uses `use_threads=False` anyway, so mostly depends on this single threaded performance). But then to not mix too many different issues at once, it might be better to focus the various timings in this issue on single threaded performance. > Parquet -> Arrow has to do a nontrivial amount of work Parquet is indeed a complex file format. In addition to the decompression, there is also the decoding (although the file here will use dictionary encoding, and that should be quite fast I would expect. Also quickly testing plain and delta_binary_packed encodings, and that actually gives slower reads than the default in this case). I was also wondering if we could have an idea which bandwidth one can expect for just the decompression, to have some point of comparison. The snappy readme (https://github.com/google/snappy) itself mentions decompression at 500MB/s for Intel Core i7. Running the snippet of Florian above, I actually only get around 100MB/s for the SNAPPY decompression.. > Arrow uses SNAPPY compression by default Quickly testing with another compression (`pq.write_table(t, "foo_lz4.parquet", compression="lz4")`, I get consistently faster reads with LZ4 compared to SNAPPY for this dataset, but only around 5-10% faster. Not a huge difference, but so in general one can always tweak the encoding and compression settings for their specific datasets to achieve optimal read performance. Using no compression at all (`compression="none"`) also gives some speed-up (but of course trading storage size with read speed, and on eg S3 that might not even be beneficial) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
