[ 
https://issues.apache.org/jira/browse/ORC-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512377#comment-17512377
 ] 

Quanlong Huang commented on ORC-1131:
-------------------------------------

I was debugging on a perf issue in transforming the orc vector batch into 
Impala's (row oriented) RowBatch. I thought cache misses is an issue so I used 
getMemoryUsage() to print the size that a batch could occupy of the cache.

Not sure how other clients use getMemoryUsage(). But Impala hasn't used it yet. 
So I'm ok with not counting the dictionary. I think at least we should count 
'blob' in StringVectorBatch since it does belong to the batch.

> [C++] getMemoryUsage() is incorrect on string vector batches 
> -------------------------------------------------------------
>
>                 Key: ORC-1131
>                 URL: https://issues.apache.org/jira/browse/ORC-1131
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> The C++ client produces two kinds of string vector batches, i.e. 
> StringVectorBatch and EncodedStringVectorBatch. They both have incorrect 
> results in getMemoryUsage() currently.
> After ORC-501, we move the blob from StringColumnReader to StringVectorBatch. 
> However, StringVectorBatch::getMemoryUsage() was not updated to count for it.
> {code:cpp}
> uint64_t StringVectorBatch::getMemoryUsage() {
>   return ColumnVectorBatch::getMemoryUsage()
>         + static_cast<uint64_t>(data.capacity() * sizeof(char*)
>         + length.capacity() * sizeof(int64_t));
> } {code}
> For EncodedStringVectorBatch, it inherits StringVectorBatch but doesn't 
> override the getMemoryUsage() method. Thus counting for wrong results.
> {code:cpp}
> struct EncodedStringVectorBatch : public StringVectorBatch { 
>   EncodedStringVectorBatch(uint64_t capacity, MemoryPool& pool);
>   virtual ~EncodedStringVectorBatch();
>   std::string toString() const;
>   void resize(uint64_t capacity);
>   std::shared_ptr<StringDictionary> dictionary;
>   // index for dictionary entry
>   DataBuffer<int64_t> index;
> };{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to