svilupp commented on issue #393: URL: https://github.com/apache/arrow-julia/issues/393#issuecomment-1465241253
TL;DR The world makes sense again! Arrow.jl is the fastest reader now (except for one case). It took leveraging threads, skipping unnecessary resizing of buffers, some initialization, and adding support for InlineStrings (stack-allocated strings). Details and the implementation for testing are in [here](https://github.com/apache/arrow-julia/pull/399) Here are some learnings for those of you seeking Arrow.jl performance: - **Partition your data** (biggest benefit!)- Arrow.jl cannot leverage multiple threads unless it's reading/writing data split in multiple "chunks" with the same types/columns (split your table `tbl` with `Iterators.partition(Tables.rows(tbl),chunksize)`, DataFrames.jl 1.5.0 now supports `Iterators.partition(df)` directly!). Associated [PR and additional resources](https://github.com/apache/arrow-julia/pull/400) - **Inline your strings** - Creating / accessing strings is expensive. If your data has a lot of short strings (length <255), you can leverage `InlineStrings.jl` library (read @quinnj's [introductory blog post](https://quinnj.hashnode.dev/inlinestringsjl-fun-with-primitive-types-and-llvm-in-julia). The simplest use is to call `inlinestrings()` on your vector strings, but that means you had to materialize normal strings before. Alternatively, you can avoid this materialization by including [the following file](https://github.com/svilupp/arrow-julia/blob/arrow-turbo/src/inlinestrings.jl) from the PR mentioned above and running method `_inlinestrings()` over your table columns, eg, for table `t`, I would run `Arrow.columns(t) .= _inlinestrings.(Arrow.columns(t))`. (I prefixed it by _ because it has weird semantics, so I didn't want to hijack the original `inlinestrings` -> we simply swap the String type of the data wrapper `Arrow.List` without materializing any strings)` The rest is probably not suitable for most users, as it involves changing the package internals: - **Initialize your codecs** - When using compression (LZ4 or ZSTD), we need to initialize the codecs first. Hence, for repeated compression/decompression, TranscodingStreams.jl, the wrapper package for all compression work, advises to [pre-initialize your codecs and re-use them](https://juliaio.github.io/TranscodingStreams.jl/latest/examples/#Transcode-lots-of-strings-1). This was done for [compressors](https://github.com/apache/arrow-julia/blob/9b36c8b1ec9efbdc63009d1b8cd72ee705fc1711/src/Arrow.jl#L80), but not for decompressors. You can do the same for decompressors (PR was opened) - **Don't resize buffers** when you don't need to - During decompression, `TranscodingStreams.jl.transcode()` function keeps resizing your output buffer in fixed increments until it's done. In my benchmarks above, we spent as much time resizing the buffers, as decompressing the data. Fortunately, Arrow IPC file format specification requires the size of each column to be saved in the metadata, so we actually know the size of the output buffer we need! I've upstreamed the [PR here](https://github.com/JuliaIO/TranscodingStreams.jl/pull/134) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
