svilupp opened a new pull request, #399:
URL: https://github.com/apache/arrow-julia/pull/399

   This PR aims to showcase various strategies how to improve the performance 
of Arrow.jl and allow for easier testing. The intention is to be broken into 
separate modular PRs for actual contributions (upon interest).
   
   TL;DR Arrow.jl beats everyone except for one case (loading large strings in 
Task1, uncompressed Polars+PyArrow is so fast - 5ms - that it clearly uses some 
lazy strategy and skips materializing the strings)
   
   Changes:
   - Project deps: added deps for InlineStrings (direct)
   - Arrow.Table - added kwarg "useinlinestrings" to load strings as 
InlineStrings whenever possible (defaults to "true")
   - Arrow.write - added chunking/partitioning as a default (same as 
[PyArrow](https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html),
 kwarg "chunksize" (defaults to "64000" like PyArrow, could be changed to 
consider available threads)
   - Compression - added thread-safe implementation with locks for 
decompression, added locks for compression (see 
https://github.com/apache/arrow-julia/pull/397)
   - TranscodingStreams (pirates involved) - added an in-place mutating 
transcode to avoid unnecessary output buffer resizing
   
   Future:
   - separate parsing the structure from materializing the buffers within 
RecordBatches. Currently, it's done together (VectorIterator), but separating 
it would enable better multithreading (it seems to be the secret ingredient of 
Polars/Pyarrow when data is not partitioned)
   - allow native serialization/deserialization of InlineStrings (wouldn't be 
readable in other languages / would look like Ints)
   
   **Timings (copied from the original thread for comparison):**
   **Task 1:** 10x count nonmissing elements in the first column of a table
   Data: 2 columns of 5K-long strings each, 10% of data missing, 10K rows
   Timings: (ordered by Uncompressed, LZ4, ZSTD)
   - Pandas: 1.2s, 1.5s, 1.6s
   - Polars: 5ms, 1.5s, 2.05s
   - Polars+PyArrow: 4.8ms, 0.26s, 0.42s
   - Arrow+32Threads: 0.17s, 2.3s, 1.6s
   - Arrow+1Thread: 0.2, 2.25s, 1.9s
   
   Data: 32 partitions (!), 2 columns of 5K-long strings each, 10% of data 
missing, 10K rows
   Timings: (ordered by Uncompressed, LZ4, ZSTD)
   - Pandas: 1.2s, 1.0s, 1.2s
   - Polars: 9ms, 2.1s, 2.8s
   - Polars+PyArrow: 1.1s, 1.3s, 1.5s
   - Arrow+32Threads: 0.22s, 0.44s, 0.4s
   (Arrow.jl timing also benefits from a quick fix to TranscodingStreams)
   
   **NEW:** partitioned using the new defaults+keywords 
`write_out_compressions(df, fn; chunksize = cld(nrow(df), Threads.nthreads()));`
   - Arrow+32Threads: 0.21s, 0.26s, 0.25s
   - Arrow+1Threads: 0.15s, 0.45s, 1.18s
   
   **Task 2:** 10x mean value of Int column in the first column of a table
   Data: 10 columns, Int64, 10M rows
   Timings: (ordered by Uncompressed, LZ4, ZSTD)
   - Pandas: 5.4s, 5.9s, 5.84s
   - Polars: 0.23s, 8s, 8.1s
   - Polars+PyArrow: 0.2s, 0.7s, 0.6s
   - Arrow+32Threads: 0.1s, 17.2s, 6.1s
   - Arrow+1Thread: 0.1s, 16.3s, 6.3s
   
   Data: 32 partitions (!), 10 columns, Int64, 10M rows
   Timings: (ordered by Uncompressed, LZ4, ZSTD)
   - Pandas: 5.6, 2.8s, 2.6s
   - Polars: 0.23s, 12.8s, 12.6s
   - Polars+PyArrow: 6.5s, 6.5s, 6.4s
   - Arrow+32Threads: 0.1s, 1.2s, 0.7s
   (Arrow.jl timing also benefits from a quick fix to TranscodingStreams)
   
   **NEW:** partitioned using the new defaults+keywords 
`write_out_compressions(df, fn;
                                chunksize = min(64000, cld(nrow(df), 
Threads.nthreads())));`
   - Arrow+32Threads: 0.23s, 2.0s, 1.8s (clearly bigger chunks are better here 
-- see the case above)
   - Arrow+1Threads: 0.24s, 4.3s, 4.05s
   
   Added a new task to test out the automatic _string inlining_
   **Task 3:** 10x count nonmissing elements in the first column of a table
   Data: 2 columns of 10 codeunits long strings each, 10% of data missing, 1M 
rows
   partitioned using the new defaults+keywords `write_out_compressions(df, fn;
                                chunksize = min(64000, cld(nrow(df), 
Threads.nthreads())));`
                               
   Timings: (ordered by Uncompressed, LZ4, ZSTD)
   - Arrow+32Threads (useinlinestrings=false): 0.38, 0.44s, 0.44s
   - Arrow+1Threads (useinlinestrings=false): 0.45, 0.48s, 0.74s
   - Arrow+32Threads (**useinlinestrings=true**): 0.1s, 0.15s, 0.2s
   - Pandas: 4.8s, 5s, 5s
   - Polars: 0.47s, 0.64s, 0.87s
   - Polars+PyArrow: 0.32s, 0.36s, 0.4s
   (for comparison, I ran also string length check on Polars+PyArrow: 0.32s, 
0.36s, 0.41s)
   
   And the best part? On my MacBook, I can get timings for Task 3 around 40ms 
(uncompressed) and 90ms (compressed), which is better than what I could have 
hoped for :)
   
   **Setup:**
   - Machine: m5dn.8xl, NVME local drive, 32 vCPU, 96GB RAM
   - Julia 1.8.5 / Arrow 2.4.3 + this PR
   - Python 3.10
   
   Benchmarking code is provided in 
https://github.com/apache/arrow-julia/issues/393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to