[
https://issues.apache.org/jira/browse/ORC-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510353#comment-17510353
]
Quanlong Huang commented on ORC-1131:
-------------------------------------
Direct encoded string columns are read by StringDirectColumnReader, while
dictionary encoded string columns are read by StringDictionaryColumnReader.
RowReader::createRowBatch() creates an EncodedStringVectorBatch for a string
column if the reader options has set EnableLazyDecoding to true.
There are actually 4 cases depending on the column encoding and the batch type
(i.e. encoded or not).
* direct encoding read with StringVectorBatch
** blob should be counted.
* direct encoding read with EncodedStringVectorBatch
** EncodedStringVectorBatch is used as a StringVectorBatch (isEncoded=false).
Same as above.
* dictionary encoding read with StringVectorBatch
** Strings reference the StringDictionary of the reader. blob is empty. I
think we don't need to count the StringDictionary.
* dictionary encoding read with EncodedStringVectorBatch
** StringDictionary shared with the reader. I'm not sure if we can skip it as
well. The ideal solution would be counting it if and only if the dictionary is
only held by the batch (e.g. when the reader is destroyed and the client is
still using the batch). But this complicates the implementation. [~wgtmac],
[~Yurui Zhou] What do you think?
> [C++] getMemoryUsage() is incorrect on string vector batches
> -------------------------------------------------------------
>
> Key: ORC-1131
> URL: https://issues.apache.org/jira/browse/ORC-1131
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.6.0
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Major
>
> The C++ client produces two kinds of string vector batches, i.e.
> StringVectorBatch and EncodedStringVectorBatch. They both have incorrect
> results in getMemoryUsage() currently.
> After ORC-501, we move the blob from StringColumnReader to StringVectorBatch.
> However, StringVectorBatch::getMemoryUsage() was not updated to count for it.
> {code:cpp}
> uint64_t StringVectorBatch::getMemoryUsage() {
> return ColumnVectorBatch::getMemoryUsage()
> + static_cast<uint64_t>(data.capacity() * sizeof(char*)
> + length.capacity() * sizeof(int64_t));
> } {code}
> For EncodedStringVectorBatch, it inherits StringVectorBatch but doesn't
> override the getMemoryUsage() method. Thus counting for wrong results.
> {code:cpp}
> struct EncodedStringVectorBatch : public StringVectorBatch {
> EncodedStringVectorBatch(uint64_t capacity, MemoryPool& pool);
> virtual ~EncodedStringVectorBatch();
> std::string toString() const;
> void resize(uint64_t capacity);
> std::shared_ptr<StringDictionary> dictionary;
> // index for dictionary entry
> DataBuffer<int64_t> index;
> };{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)