svilupp opened a new pull request, #397:
URL: https://github.com/apache/arrow-julia/pull/397

   Fixes https://github.com/apache/arrow-julia/issues/396
   
   This PR proposes to:
   - fix the issue with indexing by `Threads.threadid()` by locking the 
compressor/decompressor (see  https://github.com/apache/arrow-julia/issues/396)
   - initialize an array of decompressors, similar to what has been previously 
done for compressors (eg, 
https://github.com/apache/arrow-julia/blob/9b36c8b1ec9efbdc63009d1b8cd72ee705fc1711/src/Arrow.jl#L80)
   
   Acquiring locks introduces some additional overhead (I was not confident we 
could use SpinLocks given we traverse several packages and into C-code, so I 
opted for a slightly slower ReentrantLock).
   
   I've created a minimalistic workload (100 partitions with 1 row) to see 
mostly the lock overhead.
   I couldn't detect any slowdown in writing speed. 
   Since we're initializing decompressors upfront, we get c. 30% speed up when 
reading (no need to initialize for each record batch)
   
   ```
   using Arrow, Tables, Random
   # Speed test 
   N = 100
   len = 100
   t = Tables.rowtable((; x1 = map(x -> randstring(len), 1:N)));
   fn="test_partitioned.arrow"
   
   ## Before
   # Write:
   @btime Arrow.write($fn, Iterators.partition($t, 1); compress = :lz4)
   # 2.067 ms (12912 allocations: 13.24 MiB)
   
   # Read:
   @btime t=Arrow.Table($fn);
   # 718.458 μs (9305 allocations: 533.62 KiB)
   
   ## After
   # Write:
   @btime Arrow.write($fn, Iterators.partition($t, 1); compress = :lz4)
   # 2.021 ms (12987 allocations: 13.24 MiB)
   
   # Read:
   @btime t=Arrow.Table($fn);
   # 461.083 μs (9712 allocations: 543.52 KiB)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to