emkornfield opened a new pull request #7143:
URL: https://github.com/apache/arrow/pull/7143


   Adds two implementations of a BitRunReader, which returns set/not-set
   and number of bits in a row.
   
   - Adds benchmarks comparing the two implementations under different
   distributions.
   
   - Adds the reader for use ParquetWriter (there is a second
   location on Nullable terminal node that I left unchanged because
   it showed a performance drop of 30%, I think this is due the issue
   described in the next bullet point, or because BitVisitor is getting
   vectorized to something more efficient).
   
   - Refactors GetBatchedSpaced and GetBatchedSpacedWithDict:
     1.  Use a single templated method that adds a template parameter
         that the code can share.
     2.  Does all checking for out of bounds indices in one go instead
         of on each pass through th literal (this is a slight behavior
         change as the index returned will be different).
     3.  Makes use of the BitRunReader.
   
   Based on bechmarks BM_ColumnRead this seems to make performance worse by
   50%.  This was surprising to me and my current working theory is
   that the nullable benchmarks present the worse case scenario every
   other element is null and therefore the overhead of invoking the call
   relative to the existing code is high (using perf calls to NextRun()
   jump to top after this change).  Before making a decision on reverting
   the use of BitRunReader here I'd like to implement a more comprehensive
   set of benchmarks to test this hypothesis.
   
   Other TODOs
   - [ ] Need to revert change to testing submodule hash
   - [ ] Add attribution to wikipedia for InvertRemainingBits
   - [ ] Fix some unintelligible comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to