[
https://issues.apache.org/jira/browse/ORC-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510550#comment-17510550
]
Gang Wu commented on ORC-1131:
------------------------------
[~stigahuang] Thanks for reporting this!
It depends on the purpose.
* If getMemoryUsage() is the indicator of total raw size of the data held by
the vector, then it should reflect the memory of all rows regardless of
dictionary encoding.
* If getMemoryUsage() is the actual memory usage of the vector, then your
description above makes sense.
I prefer the 1st approach which is simpler and it makes the caller eaiser to
decide next batch size to read from orc reader. If user really wants the 2nd
use case above, I think extending orc::MemoryPool to report actual memory usage
is a better idea.
What do you think?
> [C++] getMemoryUsage() is incorrect on string vector batches
> -------------------------------------------------------------
>
> Key: ORC-1131
> URL: https://issues.apache.org/jira/browse/ORC-1131
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.6.0
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Major
>
> The C++ client produces two kinds of string vector batches, i.e.
> StringVectorBatch and EncodedStringVectorBatch. They both have incorrect
> results in getMemoryUsage() currently.
> After ORC-501, we move the blob from StringColumnReader to StringVectorBatch.
> However, StringVectorBatch::getMemoryUsage() was not updated to count for it.
> {code:cpp}
> uint64_t StringVectorBatch::getMemoryUsage() {
> return ColumnVectorBatch::getMemoryUsage()
> + static_cast<uint64_t>(data.capacity() * sizeof(char*)
> + length.capacity() * sizeof(int64_t));
> } {code}
> For EncodedStringVectorBatch, it inherits StringVectorBatch but doesn't
> override the getMemoryUsage() method. Thus counting for wrong results.
> {code:cpp}
> struct EncodedStringVectorBatch : public StringVectorBatch {
> EncodedStringVectorBatch(uint64_t capacity, MemoryPool& pool);
> virtual ~EncodedStringVectorBatch();
> std::string toString() const;
> void resize(uint64_t capacity);
> std::shared_ptr<StringDictionary> dictionary;
> // index for dictionary entry
> DataBuffer<int64_t> index;
> };{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)