[ 
https://issues.apache.org/jira/browse/ORC-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated ORC-1132:
--------------------------------
    Description: 
The constructor of EncodedStringVectorBatch invokes the constructor of 
StringVectorBatch with batch capacity:
{code:cpp}
  EncodedStringVectorBatch::EncodedStringVectorBatch(uint64_t _capacity,
                                                     MemoryPool& pool)
                      : StringVectorBatch(_capacity, pool),
                        dictionary(),
                        index(pool, _capacity) {
    // PASS
  }
 {code}
This allocates unused `data` and `length` buffer in StringVectorBatch:
{code:cpp}
  StringVectorBatch::StringVectorBatch(uint64_t _capacity, MemoryPool& pool
               ): ColumnVectorBatch(_capacity, pool),
                  data(pool, _capacity),
                  length(pool, _capacity),
                  blob(pool) {
    // PASS
  }
{code}
We either use the `index` buffer and `dictionary` of EncodedStringVectorBatch 
(when the column is in dictionary encodings), or use the `data` and `length` 
buffers of the base class (when the column is in direct encodings). It'd be a 
waste to allocate buffers for all of them.

  was:
The constructor of EncodedStringVectorBatch invokes the constructor of 
StringVectorBatch with batch capacity:
{code:cpp}
  EncodedStringVectorBatch::EncodedStringVectorBatch(uint64_t _capacity,
                                                     MemoryPool& pool)
                      : StringVectorBatch(_capacity, pool),
                        dictionary(),
                        index(pool, _capacity) {
    // PASS
  }
 {code}
This allocates unused `data` and `length` buffer in StringVectorBatch:
{code:cpp}
  StringVectorBatch::StringVectorBatch(uint64_t _capacity, MemoryPool& pool
               ): ColumnVectorBatch(_capacity, pool),
                  data(pool, _capacity),
                  length(pool, _capacity),
                  blob(pool) {
    // PASS
  }
{code}
We only use the `index` buffer and `dictionary` of EncodedStringVectorBatch:
{code:cpp}
  void StringDictionaryColumnReader::nextEncoded(ColumnVectorBatch& rowBatch,
                                                  uint64_t numValues,
                                                  char* notNull) {
    ColumnReader::next(rowBatch, numValues, notNull);
    notNull = rowBatch.hasNulls ? rowBatch.notNull.data() : nullptr;
    rowBatch.isEncoded = true;

    EncodedStringVectorBatch& batch = 
dynamic_cast<EncodedStringVectorBatch&>(rowBatch);
    batch.dictionary = this->dictionary;

    // Length buffer is reused to save dictionary entry ids
    rle->next(batch.index.data(), numValues, notNull);
  }
{code}
Thus we should avoid allocating buffers in the base class.


> [C++] EncodedStringVectorBatch allocates unused buffers
> -------------------------------------------------------
>
>                 Key: ORC-1132
>                 URL: https://issues.apache.org/jira/browse/ORC-1132
>             Project: ORC
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> The constructor of EncodedStringVectorBatch invokes the constructor of 
> StringVectorBatch with batch capacity:
> {code:cpp}
>   EncodedStringVectorBatch::EncodedStringVectorBatch(uint64_t _capacity,
>                                                      MemoryPool& pool)
>                       : StringVectorBatch(_capacity, pool),
>                         dictionary(),
>                         index(pool, _capacity) {
>     // PASS
>   }
>  {code}
> This allocates unused `data` and `length` buffer in StringVectorBatch:
> {code:cpp}
>   StringVectorBatch::StringVectorBatch(uint64_t _capacity, MemoryPool& pool
>                ): ColumnVectorBatch(_capacity, pool),
>                   data(pool, _capacity),
>                   length(pool, _capacity),
>                   blob(pool) {
>     // PASS
>   }
> {code}
> We either use the `index` buffer and `dictionary` of EncodedStringVectorBatch 
> (when the column is in dictionary encodings), or use the `data` and `length` 
> buffers of the base class (when the column is in direct encodings). It'd be a 
> waste to allocate buffers for all of them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to