[GitHub] [arrow-julia] svilupp commented on issue #393: Benchmark of Arrow.jl vs Pyarrow (/Polars)

via GitHub Sun, 12 Mar 2023 09:26:54 -0700


svilupp commented on issue #393:
URL: https://github.com/apache/arrow-julia/issues/393#issuecomment-1465241253


   TL;DR The world makes sense again! Arrow.jl is the fastest reader now 
(except for one case). It took leveraging threads, skipping unnecessary 
resizing of buffers, some initialization, and adding support for InlineStrings 
(stack-allocated strings). Details and the implementation for testing are in 
[here](https://github.com/apache/arrow-julia/pull/399)
   
   Here are some learnings for those of you seeking Arrow.jl performance:
   - **Partition your data** (biggest benefit!)- Arrow.jl cannot leverage 
multiple threads unless it's reading/writing data split in multiple "chunks" 
with the same types/columns (split your table `tbl` with 
`Iterators.partition(Tables.rows(tbl),chunksize)`, DataFrames.jl 1.5.0 now 
supports `Iterators.partition(df)` directly!). Associated [PR and additional 
resources](https://github.com/apache/arrow-julia/pull/400)
   - **Inline your strings** - Creating / accessing strings is expensive. If 
your data has a lot of short strings (length <255), you can leverage 
`InlineStrings.jl` library (read @quinnj's [introductory blog 
post](https://quinnj.hashnode.dev/inlinestringsjl-fun-with-primitive-types-and-llvm-in-julia).
 The simplest use is to call `inlinestrings()` on your vector strings, but that 
means you had to materialize normal strings before. Alternatively, you can 
avoid this materialization by including [the following 
file](https://github.com/svilupp/arrow-julia/blob/arrow-turbo/src/inlinestrings.jl)
 from the PR mentioned above and running method `_inlinestrings()` over your 
table columns, eg, for table `t`, I would run `Arrow.columns(t) .= 
_inlinestrings.(Arrow.columns(t))`. (I prefixed it by _ because it has weird 
semantics, so I didn't want to hijack the original `inlinestrings` -> we simply 
swap the String type of the data wrapper `Arrow.List` without materializing any 
strings)`
   
   The rest is probably not suitable for most users, as it involves changing 
the package internals:
   - **Initialize your codecs** - When using compression (LZ4 or ZSTD), we need 
to initialize the codecs first. Hence, for repeated compression/decompression, 
TranscodingStreams.jl, the wrapper package for all compression work, advises to 
[pre-initialize your codecs and re-use 
them](https://juliaio.github.io/TranscodingStreams.jl/latest/examples/#Transcode-lots-of-strings-1).
 This was done for 
[compressors](https://github.com/apache/arrow-julia/blob/9b36c8b1ec9efbdc63009d1b8cd72ee705fc1711/src/Arrow.jl#L80),
 but not for decompressors. You can do the same for decompressors (PR was 
opened)
   - **Don't resize buffers** when you don't need to -  During decompression, 
`TranscodingStreams.jl.transcode()` function keeps resizing your output buffer 
in fixed increments until it's done. In my benchmarks above, we spent as much 
time resizing the buffers, as decompressing the data. Fortunately, Arrow IPC 
file format specification requires the size of each column to be saved in the 
metadata, so we actually know the size of the output buffer we need! I've 
upstreamed the [PR 
here](https://github.com/JuliaIO/TranscodingStreams.jl/pull/134)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-julia] svilupp commented on issue #393: Benchmark of Arrow.jl vs Pyarrow (/Polars)

Reply via email to