Quanlong Huang created IMPALA-11204:
---------------------------------------
Summary: OrcStringColumnReader should be a template class
Key: IMPALA-11204
URL: https://issues.apache.org/jira/browse/IMPALA-11204
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Quanlong Huang
Assignee: Quanlong Huang
There are some duplicated checks in OrcStringColumnReader::ReadValue() which is
executed for each row. They are static and should only be performed once per
ORC stripe.
{code:cpp}
Status OrcStringColumnReader::ReadValue(int row_idx, Tuple* tuple, MemPool*
pool) {
if (IsNull(DCHECK_NOTNULL(batch_), row_idx)) {
SetNullSlot(tuple);
return Status::OK();
}
char* src_ptr;
int src_len;
if (batch_->isEncoded) {
orc::EncodedStringVectorBatch* currentBatch =
static_cast<orc::EncodedStringVectorBatch*>(batch_);
orc::DataBuffer<int64_t>& offsets =
currentBatch->dictionary->dictionaryOffset;
int64_t index = currentBatch->index[row_idx];
if (UNLIKELY(index < 0 || static_cast<uint64_t>(index) + 1 >=
offsets.size())) {
return Status(Substitute("Corrupt ORC file: $0. Index ($1) out of range
[0, $2) in "
"StringDictionary.", scanner_->filename(), index, offsets.size()));;
}
src_ptr = blob_ + offsets[index];
src_len = offsets[index + 1] - offsets[index];
} else {
// The pointed data is now in blob_, a buffer handled by Impala.
src_ptr = blob_ + (batch_->data[row_idx] - batch_->blob.data());
src_len = batch_->length[row_idx];
}
int dst_len = slot_desc_->type().len;
if (slot_desc_->type().type == TYPE_CHAR) {
int unpadded_len = min(dst_len, static_cast<int>(src_len));
char* dst_char = reinterpret_cast<char*>(GetSlot(tuple));
memcpy(dst_char, src_ptr, unpadded_len);
StringValue::PadWithSpaces(dst_char, dst_len, unpadded_len);
return Status::OK();
}
StringValue* dst = reinterpret_cast<StringValue*>(GetSlot(tuple));
if (slot_desc_->type().type == TYPE_VARCHAR && src_len > dst_len) {
dst->len = dst_len;
} else {
dst->len = src_len;
}
dst->ptr = src_ptr;
return Status::OK();
}
{code}
This method is executed for each row. The complexity of it causes the compiler
unable to inline it. I can see this method takes some portion of the time in
TPC-H queries (e.g. 23% in Q12).
Actually, the slot type is determined when we are creating the reader. Whether
the orc vector batch is encoded is determined when we start reading a new
stripe. If the string column is in dictionary encodings in the stripe and we
set EnableLazyDecoding to true in RowReaderOptions, the orc vector batch will
be encoded.
[https://github.com/apache/orc/blob/21259815ae665f6a4beac29edb4fab52c2403e13/c%2B%2B/src/Reader.cc#L1205]
[https://github.com/apache/orc/blob/21259815ae665f6a4beac29edb4fab52c2403e13/c%2B%2B/src/ColumnReader.cc#L1843-L1847]
We can switch to a template implementation so the compile can remove these
checks and inline the method.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)