Quanlong Huang created ORC-1142:
-----------------------------------
Summary: [C++] Unroll loops in BooleanRleDecoderImpl::next()
Key: ORC-1142
URL: https://issues.apache.org/jira/browse/ORC-1142
Project: ORC
Issue Type: Improvement
Components: C++
Reporter: Quanlong Huang
Attachments: bool-rle-opt.patch
BooleanRleDecoderImpl::next() takes some portion of the time when reading
nullable columns. Here is the histogram of running orc-scan on a tpcds
store_sales data file:
{code:java}
Samples: 7K of event 'cycles:ppp', Event count (approx.): 8731129738
Overhead Command Shared Object Symbol
31.71% orc-scan orc-scan [.] orc::Decimal64ColumnReader::next
7.44% orc-scan orc-scan [.]
orc::RleDecoderV2::copyDataFromBuffer
7.29% orc-scan orc-scan [.]
snappy::SnappyDecompressor::DecompressAllTags<snappy::SnappyArrayWriter>
6.19% orc-scan orc-scan [.] orc::BooleanRleDecoderImpl::next
5.54% orc-scan orc-scan [.] orc::RleDecoderV2::nextDirect
5.37% orc-scan orc-scan [.] orc::RleDecoderV2::nextShortRepeats
4.58% orc-scan orc-scan [.] snappy::(anonymous
namespace)::IncrementalCopy
3.66% orc-scan orc-scan [.] snappy::(anonymous
namespace)::UnalignedCopy64
3.43% orc-scan orc-scan [.] orc::RleDecoderV2::nextDelta
2.81% orc-scan orc-scan [.]
snappy::SnappyArrayWriter::AppendFromSelf
2.41% orc-scan orc-scan [.] orc::RleDecoderV2::next
2.22% orc-scan orc-scan [.] orc::RleDecoderV2::readByte
1.78% orc-scan orc-scan [.] snappy::(anonymous
namespace)::UnalignedCopy128
1.61% orc-scan libc-2.23.so [.] __memcpy_avx_unaligned
1.55% orc-scan orc-scan [.]
snappy::SnappyArrayWriter::TryFastAppend
1.44% orc-scan [kernel.kallsyms] [k] copy_user_enhanced_fast_string
1.30% orc-scan orc-scan [.] orc::ByteRleDecoderImpl::next
1.17% orc-scan orc-scan [.] orc::RleDecoderV2::unrolledUnpack24
1.09% orc-scan libc-2.23.so [.] __memset_avx2
1.02% orc-scan orc-scan [.] snappy::SnappyArrayWriter::Produced
0.93% orc-scan orc-scan [.] orc::RleDecoderV2::readLongs
0.81% orc-scan orc-scan [.] orc::RleDecoderV2::unrolledUnpack16
0.76% orc-scan orc-scan [.] orc::RleDecoderV2::unrolledUnpack32
0.64% orc-scan orc-scan [.] snappy::LittleEndian::Load32
0.37% orc-scan orc-scan [.] snappy::LittleEndian::ToHost32
0.27% orc-scan orc-scan [.] orc::ColumnReader::next
{code}
The time on BooleanRleDecoderImpl::next() is dominanted by this loop:
https://github.com/apache/orc/blob/78849c99359cfb9f5c9da88b8e16d118154ed9da/c%2B%2B/src/ByteRLE.cc#L627-L631
{code:cpp}
for(int64_t i=static_cast<int64_t>(numValues) - 1;
i >= static_cast<int64_t>(position); --i, --bitsLeft) {
uint64_t shiftPosn = (-bitsLeft) % 8;
data[i] = (data[position + (bitsLeft - 1) / 8] >> shiftPosn) & 0x1;
}
{code}
Disassembly of it:
{code:java}
orc::BooleanRleDecoderImpl::next
/home/quanlong/workspace/orc/build/tools/src/orc-scan
Percent│ pop %r14
...
8.67 │240:┌─→mov %r13,%rcx
8.47 │ │ sub $0x1,%r13
8.06 │ │ mov %r13,%rax
7.26 │ │ neg %rcx
8.06 │ │ shr $0x3,%rax
6.85 │ │ and $0x7,%ecx
5.85 │ │ movsbl (%rsi,%rax,1),%eax
19.96 │ │ sar %cl,%eax
8.06 │ │ and $0x1,%eax
7.66 │ │ mov %al,(%r15,%r13,1)
│ ├──cmp %rbx,%r13
9.68 │ └──jne 7d3240 <orc::BooleanRleDecoderImpl::next(char*, unsigned
long, 240
│ ↑ jmpq 7d316f <orc::BooleanRleDecoderImpl::next(char*, unsigned
long, 16f
│26b: mov %r14,%r13
│ xor %ebx,%ebx
│ ↑ jmpq 7d31c2 <orc::BooleanRleDecoderImpl::next(char*, unsigned
long, 1c2
{code}
We can unroll the loop to save some instructions. I have a WIP patch but I
haven't optimized it further.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)