[jira] [Created] (ORC-1142) [C++] Unroll loops in BooleanRleDecoderImpl::next()

Quanlong Huang (Jira) Mon, 28 Mar 2022 18:37:06 -0700

Quanlong Huang created ORC-1142:
-----------------------------------

             Summary: [C++] Unroll loops in BooleanRleDecoderImpl::next()
                 Key: ORC-1142
                 URL: https://issues.apache.org/jira/browse/ORC-1142
             Project: ORC
          Issue Type: Improvement
          Components: C++
            Reporter: Quanlong Huang
         Attachments: bool-rle-opt.patch


BooleanRleDecoderImpl::next() takes some portion of the time when reading 
nullable columns. Here is the histogram of running orc-scan on a tpcds 
store_sales data file:
{code:java}
Samples: 7K of event 'cycles:ppp', Event count (approx.): 8731129738
Overhead  Command   Shared Object        Symbol
  31.71%  orc-scan  orc-scan             [.] orc::Decimal64ColumnReader::next
   7.44%  orc-scan  orc-scan             [.] 
orc::RleDecoderV2::copyDataFromBuffer
   7.29%  orc-scan  orc-scan             [.] 
snappy::SnappyDecompressor::DecompressAllTags<snappy::SnappyArrayWriter>
   6.19%  orc-scan  orc-scan             [.] orc::BooleanRleDecoderImpl::next
   5.54%  orc-scan  orc-scan             [.] orc::RleDecoderV2::nextDirect
   5.37%  orc-scan  orc-scan             [.] orc::RleDecoderV2::nextShortRepeats
   4.58%  orc-scan  orc-scan             [.] snappy::(anonymous 
namespace)::IncrementalCopy
   3.66%  orc-scan  orc-scan             [.] snappy::(anonymous 
namespace)::UnalignedCopy64
   3.43%  orc-scan  orc-scan             [.] orc::RleDecoderV2::nextDelta
   2.81%  orc-scan  orc-scan             [.] 
snappy::SnappyArrayWriter::AppendFromSelf
   2.41%  orc-scan  orc-scan             [.] orc::RleDecoderV2::next
   2.22%  orc-scan  orc-scan             [.] orc::RleDecoderV2::readByte
   1.78%  orc-scan  orc-scan             [.] snappy::(anonymous 
namespace)::UnalignedCopy128
   1.61%  orc-scan  libc-2.23.so         [.] __memcpy_avx_unaligned
   1.55%  orc-scan  orc-scan             [.] 
snappy::SnappyArrayWriter::TryFastAppend
   1.44%  orc-scan  [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
   1.30%  orc-scan  orc-scan             [.] orc::ByteRleDecoderImpl::next
   1.17%  orc-scan  orc-scan             [.] orc::RleDecoderV2::unrolledUnpack24
   1.09%  orc-scan  libc-2.23.so         [.] __memset_avx2
   1.02%  orc-scan  orc-scan             [.] snappy::SnappyArrayWriter::Produced
   0.93%  orc-scan  orc-scan             [.] orc::RleDecoderV2::readLongs
   0.81%  orc-scan  orc-scan             [.] orc::RleDecoderV2::unrolledUnpack16
   0.76%  orc-scan  orc-scan             [.] orc::RleDecoderV2::unrolledUnpack32
   0.64%  orc-scan  orc-scan             [.] snappy::LittleEndian::Load32
   0.37%  orc-scan  orc-scan             [.] snappy::LittleEndian::ToHost32
   0.27%  orc-scan  orc-scan             [.] orc::ColumnReader::next
{code}
The time on BooleanRleDecoderImpl::next() is dominanted by this loop:
https://github.com/apache/orc/blob/78849c99359cfb9f5c9da88b8e16d118154ed9da/c%2B%2B/src/ByteRLE.cc#L627-L631
{code:cpp}
for(int64_t i=static_cast<int64_t>(numValues) - 1;
    i >= static_cast<int64_t>(position); --i, --bitsLeft) {
  uint64_t shiftPosn = (-bitsLeft) % 8;
  data[i] = (data[position + (bitsLeft - 1) / 8] >> shiftPosn) & 0x1;
}
{code}
Disassembly of it:
{code:java}
orc::BooleanRleDecoderImpl::next  
/home/quanlong/workspace/orc/build/tools/src/orc-scan
Percent│       pop    %r14
...
  8.67 │240:┌─→mov    %r13,%rcx
  8.47 │    │  sub    $0x1,%r13
  8.06 │    │  mov    %r13,%rax
  7.26 │    │  neg    %rcx
  8.06 │    │  shr    $0x3,%rax
  6.85 │    │  and    $0x7,%ecx
  5.85 │    │  movsbl (%rsi,%rax,1),%eax
 19.96 │    │  sar    %cl,%eax
  8.06 │    │  and    $0x1,%eax
  7.66 │    │  mov    %al,(%r15,%r13,1)
       │    ├──cmp    %rbx,%r13
  9.68 │    └──jne    7d3240 <orc::BooleanRleDecoderImpl::next(char*, unsigned 
long, 240 
       │     ↑ jmpq   7d316f <orc::BooleanRleDecoderImpl::next(char*, unsigned 
long, 16f 
       │26b:   mov    %r14,%r13
       │       xor    %ebx,%ebx
       │     ↑ jmpq   7d31c2 <orc::BooleanRleDecoderImpl::next(char*, unsigned 
long, 1c2
{code}
We can unroll the loop to save some instructions. I have a WIP patch but I 
haven't optimized it further.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ORC-1142) [C++] Unroll loops in BooleanRleDecoderImpl::next()

Reply via email to