Quanlong Huang created ORC-1132: ----------------------------------- Summary: [C++] EncodedStringVectorBatch allocates used buffers Key: ORC-1132 URL: https://issues.apache.org/jira/browse/ORC-1132 Project: ORC Issue Type: Improvement Affects Versions: 1.6.0 Reporter: Quanlong Huang Assignee: Quanlong Huang
The constructor of EncodedStringVectorBatch invokes the constructor of StringVectorBatch with batch capacity: {code:cpp} EncodedStringVectorBatch::EncodedStringVectorBatch(uint64_t _capacity, MemoryPool& pool) : StringVectorBatch(_capacity, pool), dictionary(), index(pool, _capacity) { // PASS } {code} This allocates unused `data` and `length` buffer in StringVectorBatch: {code:cpp} StringVectorBatch::StringVectorBatch(uint64_t _capacity, MemoryPool& pool ): ColumnVectorBatch(_capacity, pool), data(pool, _capacity), length(pool, _capacity), blob(pool) { // PASS } {code} We only use the `index` buffer and `dictionary` of EncodedStringVectorBatch: {code:cpp} void StringDictionaryColumnReader::nextEncoded(ColumnVectorBatch& rowBatch, uint64_t numValues, char* notNull) { ColumnReader::next(rowBatch, numValues, notNull); notNull = rowBatch.hasNulls ? rowBatch.notNull.data() : nullptr; rowBatch.isEncoded = true; EncodedStringVectorBatch& batch = dynamic_cast<EncodedStringVectorBatch&>(rowBatch); batch.dictionary = this->dictionary; // Length buffer is reused to save dictionary entry ids rle->next(batch.index.data(), numValues, notNull); } {code} Thus we should avoid allocating buffers in the base class. -- This message was sent by Atlassian Jira (v8.20.1#820001)