[ https://issues.apache.org/jira/browse/HIVE-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915238#comment-13915238 ]
Gopal V commented on HIVE-6518: ------------------------------- Yes, also the ORC scenario is more complex for strings in dictionaries. A substring does not drop the rest of the data off the memory overhead because in vectorized mode, only the start:len get modified, no new allocations are made. So a group by SUBSTR() will keep the entire string in memory, except the VGBY does not know that it does. > Add a GC canary to the VectorGroupByOperator to flush whenever a GC is > triggered > -------------------------------------------------------------------------------- > > Key: HIVE-6518 > URL: https://issues.apache.org/jira/browse/HIVE-6518 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.13.0 > Reporter: Gopal V > Assignee: Gopal V > Priority: Minor > Attachments: HIVE-6518.1-tez.patch > > > The current VectorGroupByOperator implementation flushes the in-memory hashes > when the maximum entries or fraction of memory is hit. > This works for most cases, but there are some corner cases where we hit GC > ovehead limits or heap size limits before either of those conditions are > reached due to the rest of the pipeline. > This patch adds a SoftReference as a GC canary. If the soft reference is > dead, then a full GC pass happened sometime in the near past & the > aggregation hashtables should be flushed immediately before another full GC > is triggered. -- This message was sent by Atlassian JIRA (v6.1.5#6160)