svilupp opened a new pull request, #397: URL: https://github.com/apache/arrow-julia/pull/397
Fixes https://github.com/apache/arrow-julia/issues/396 This PR proposes to: - fix the issue with indexing by `Threads.threadid()` by locking the compressor/decompressor (see https://github.com/apache/arrow-julia/issues/396) - initialize an array of decompressors, similar to what has been previously done for compressors (eg, https://github.com/apache/arrow-julia/blob/9b36c8b1ec9efbdc63009d1b8cd72ee705fc1711/src/Arrow.jl#L80) Acquiring locks introduces some additional overhead (I was not confident we could use SpinLocks given we traverse several packages and into C-code, so I opted for a slightly slower ReentrantLock). I've created a minimalistic workload (100 partitions with 1 row) to see mostly the lock overhead. I couldn't detect any slowdown in writing speed. Since we're initializing decompressors upfront, we get c. 30% speed up when reading (no need to initialize for each record batch) ``` using Arrow, Tables, Random # Speed test N = 100 len = 100 t = Tables.rowtable((; x1 = map(x -> randstring(len), 1:N))); fn="test_partitioned.arrow" ## Before # Write: @btime Arrow.write($fn, Iterators.partition($t, 1); compress = :lz4) # 2.067 ms (12912 allocations: 13.24 MiB) # Read: @btime t=Arrow.Table($fn); # 718.458 μs (9305 allocations: 533.62 KiB) ## After # Write: @btime Arrow.write($fn, Iterators.partition($t, 1); compress = :lz4) # 2.021 ms (12987 allocations: 13.24 MiB) # Read: @btime t=Arrow.Table($fn); # 461.083 μs (9712 allocations: 543.52 KiB) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
