[
https://issues.apache.org/jira/browse/ORC-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated ORC-1132:
--------------------------------
Description:
The constructor of EncodedStringVectorBatch invokes the constructor of
StringVectorBatch with batch capacity:
{code:cpp}
EncodedStringVectorBatch::EncodedStringVectorBatch(uint64_t _capacity,
MemoryPool& pool)
: StringVectorBatch(_capacity, pool),
dictionary(),
index(pool, _capacity) {
// PASS
}
{code}
This allocates unused `data` and `length` buffer in StringVectorBatch:
{code:cpp}
StringVectorBatch::StringVectorBatch(uint64_t _capacity, MemoryPool& pool
): ColumnVectorBatch(_capacity, pool),
data(pool, _capacity),
length(pool, _capacity),
blob(pool) {
// PASS
}
{code}
We either use the `index` buffer and `dictionary` of EncodedStringVectorBatch
(when the column is in dictionary encodings), or use the `data` and `length`
buffers of the base class (when the column is in direct encodings). It'd be a
waste to allocate buffers for all of them.
was:
The constructor of EncodedStringVectorBatch invokes the constructor of
StringVectorBatch with batch capacity:
{code:cpp}
EncodedStringVectorBatch::EncodedStringVectorBatch(uint64_t _capacity,
MemoryPool& pool)
: StringVectorBatch(_capacity, pool),
dictionary(),
index(pool, _capacity) {
// PASS
}
{code}
This allocates unused `data` and `length` buffer in StringVectorBatch:
{code:cpp}
StringVectorBatch::StringVectorBatch(uint64_t _capacity, MemoryPool& pool
): ColumnVectorBatch(_capacity, pool),
data(pool, _capacity),
length(pool, _capacity),
blob(pool) {
// PASS
}
{code}
We only use the `index` buffer and `dictionary` of EncodedStringVectorBatch:
{code:cpp}
void StringDictionaryColumnReader::nextEncoded(ColumnVectorBatch& rowBatch,
uint64_t numValues,
char* notNull) {
ColumnReader::next(rowBatch, numValues, notNull);
notNull = rowBatch.hasNulls ? rowBatch.notNull.data() : nullptr;
rowBatch.isEncoded = true;
EncodedStringVectorBatch& batch =
dynamic_cast<EncodedStringVectorBatch&>(rowBatch);
batch.dictionary = this->dictionary;
// Length buffer is reused to save dictionary entry ids
rle->next(batch.index.data(), numValues, notNull);
}
{code}
Thus we should avoid allocating buffers in the base class.
> [C++] EncodedStringVectorBatch allocates unused buffers
> -------------------------------------------------------
>
> Key: ORC-1132
> URL: https://issues.apache.org/jira/browse/ORC-1132
> Project: ORC
> Issue Type: Improvement
> Affects Versions: 1.6.0
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Major
>
> The constructor of EncodedStringVectorBatch invokes the constructor of
> StringVectorBatch with batch capacity:
> {code:cpp}
> EncodedStringVectorBatch::EncodedStringVectorBatch(uint64_t _capacity,
> MemoryPool& pool)
> : StringVectorBatch(_capacity, pool),
> dictionary(),
> index(pool, _capacity) {
> // PASS
> }
> {code}
> This allocates unused `data` and `length` buffer in StringVectorBatch:
> {code:cpp}
> StringVectorBatch::StringVectorBatch(uint64_t _capacity, MemoryPool& pool
> ): ColumnVectorBatch(_capacity, pool),
> data(pool, _capacity),
> length(pool, _capacity),
> blob(pool) {
> // PASS
> }
> {code}
> We either use the `index` buffer and `dictionary` of EncodedStringVectorBatch
> (when the column is in dictionary encodings), or use the `data` and `length`
> buffers of the base class (when the column is in direct encodings). It'd be a
> waste to allocate buffers for all of them.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)