svilupp opened a new pull request, #399: URL: https://github.com/apache/arrow-julia/pull/399
This PR aims to showcase various strategies how to improve the performance of Arrow.jl and allow for easier testing. The intention is to be broken into separate modular PRs for actual contributions (upon interest). TL;DR Arrow.jl beats everyone except for one case (loading large strings in Task1, uncompressed Polars+PyArrow is so fast - 5ms - that it clearly uses some lazy strategy and skips materializing the strings) Changes: - Project deps: added deps for InlineStrings (direct) - Arrow.Table - added kwarg "useinlinestrings" to load strings as InlineStrings whenever possible (defaults to "true") - Arrow.write - added chunking/partitioning as a default (same as [PyArrow](https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html), kwarg "chunksize" (defaults to "64000" like PyArrow, could be changed to consider available threads) - Compression - added thread-safe implementation with locks for decompression, added locks for compression (see https://github.com/apache/arrow-julia/pull/397) - TranscodingStreams (pirates involved) - added an in-place mutating transcode to avoid unnecessary output buffer resizing Future: - separate parsing the structure from materializing the buffers within RecordBatches. Currently, it's done together (VectorIterator), but separating it would enable better multithreading (it seems to be the secret ingredient of Polars/Pyarrow when data is not partitioned) - allow native serialization/deserialization of InlineStrings (wouldn't be readable in other languages / would look like Ints) **Timings (copied from the original thread for comparison):** **Task 1:** 10x count nonmissing elements in the first column of a table Data: 2 columns of 5K-long strings each, 10% of data missing, 10K rows Timings: (ordered by Uncompressed, LZ4, ZSTD) - Pandas: 1.2s, 1.5s, 1.6s - Polars: 5ms, 1.5s, 2.05s - Polars+PyArrow: 4.8ms, 0.26s, 0.42s - Arrow+32Threads: 0.17s, 2.3s, 1.6s - Arrow+1Thread: 0.2, 2.25s, 1.9s Data: 32 partitions (!), 2 columns of 5K-long strings each, 10% of data missing, 10K rows Timings: (ordered by Uncompressed, LZ4, ZSTD) - Pandas: 1.2s, 1.0s, 1.2s - Polars: 9ms, 2.1s, 2.8s - Polars+PyArrow: 1.1s, 1.3s, 1.5s - Arrow+32Threads: 0.22s, 0.44s, 0.4s (Arrow.jl timing also benefits from a quick fix to TranscodingStreams) **NEW:** partitioned using the new defaults+keywords `write_out_compressions(df, fn; chunksize = cld(nrow(df), Threads.nthreads()));` - Arrow+32Threads: 0.21s, 0.26s, 0.25s - Arrow+1Threads: 0.15s, 0.45s, 1.18s **Task 2:** 10x mean value of Int column in the first column of a table Data: 10 columns, Int64, 10M rows Timings: (ordered by Uncompressed, LZ4, ZSTD) - Pandas: 5.4s, 5.9s, 5.84s - Polars: 0.23s, 8s, 8.1s - Polars+PyArrow: 0.2s, 0.7s, 0.6s - Arrow+32Threads: 0.1s, 17.2s, 6.1s - Arrow+1Thread: 0.1s, 16.3s, 6.3s Data: 32 partitions (!), 10 columns, Int64, 10M rows Timings: (ordered by Uncompressed, LZ4, ZSTD) - Pandas: 5.6, 2.8s, 2.6s - Polars: 0.23s, 12.8s, 12.6s - Polars+PyArrow: 6.5s, 6.5s, 6.4s - Arrow+32Threads: 0.1s, 1.2s, 0.7s (Arrow.jl timing also benefits from a quick fix to TranscodingStreams) **NEW:** partitioned using the new defaults+keywords `write_out_compressions(df, fn; chunksize = min(64000, cld(nrow(df), Threads.nthreads())));` - Arrow+32Threads: 0.23s, 2.0s, 1.8s (clearly bigger chunks are better here -- see the case above) - Arrow+1Threads: 0.24s, 4.3s, 4.05s Added a new task to test out the automatic _string inlining_ **Task 3:** 10x count nonmissing elements in the first column of a table Data: 2 columns of 10 codeunits long strings each, 10% of data missing, 1M rows partitioned using the new defaults+keywords `write_out_compressions(df, fn; chunksize = min(64000, cld(nrow(df), Threads.nthreads())));` Timings: (ordered by Uncompressed, LZ4, ZSTD) - Arrow+32Threads (useinlinestrings=false): 0.38, 0.44s, 0.44s - Arrow+1Threads (useinlinestrings=false): 0.45, 0.48s, 0.74s - Arrow+32Threads (**useinlinestrings=true**): 0.1s, 0.15s, 0.2s - Pandas: 4.8s, 5s, 5s - Polars: 0.47s, 0.64s, 0.87s - Polars+PyArrow: 0.32s, 0.36s, 0.4s (for comparison, I ran also string length check on Polars+PyArrow: 0.32s, 0.36s, 0.41s) And the best part? On my MacBook, I can get timings for Task 3 around 40ms (uncompressed) and 90ms (compressed), which is better than what I could have hoped for :) **Setup:** - Machine: m5dn.8xl, NVME local drive, 32 vCPU, 96GB RAM - Julia 1.8.5 / Arrow 2.4.3 + this PR - Python 3.10 Benchmarking code is provided in https://github.com/apache/arrow-julia/issues/393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
