[jira] [Work logged] (HIVE-23843) Improve key evictions in VectorGroupByOperator

ASF GitHub Bot (Jira) Tue, 14 Jul 2020 06:26:10 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-23843?focusedWorklogId=458606&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458606
 ]


ASF GitHub Bot logged work on HIVE-23843:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Jul/20 13:25
            Start Date: 14/Jul/20 13:25
    Worklog Time Spent: 10m 
      Work Description: kgyrtkirk commented on a change in pull request #1250:
URL: https://github.com/apache/hive/pull/1250#discussion_r454349711



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java
##########
@@ -358,6 +360,34 @@ public void close(boolean aborted) throws HiveException {
      */
     private long numRowsCompareHashAggr;
 
+    /**
+     * To track current memory usage.
+     */
+    private long currMemUsed;
+
+    /**
+     * Whether to make use of LRUCache for map aggr buffers or not.
+     */
+    private boolean lruCache;
+
+    class LRUCache extends LinkedHashMap<KeyWrapper, 
VectorAggregationBufferRow> {
+
+      @Override
+      protected boolean removeEldestEntry(Map.Entry<KeyWrapper, 
VectorAggregationBufferRow> eldest) {
+        if (currMemUsed > maxHashTblMemory || size() > maxHtEntries || 
gcCanary.get() == null) {

Review comment:
       this method seems to have been polluted by the "isFull" logic - which is 
unexpected with this method name
   the "isFull" should be moved outside - and remove should only called when 
the condition is met

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4065,6 +4065,9 @@ private static void populateLlapDaemonVarsSet(Set<String> 
llapDaemonVarsSetLocal
     
HIVE_VECTORIZATION_GROUPBY_MAXENTRIES("hive.vectorized.groupby.maxentries", 
1000000,
         "Max number of entries in the vector group by aggregation hashtables. 
\n" +
         "Exceeding this will trigger a flush irrelevant of memory pressure 
condition."),
+    HIVE_VECTORIZATION_GROUPBY_ENABLE_LRU_FOR_AGGR(

Review comment:
       instead of introducing a boolean toggle; add a mode switch 
(default/lru/etc)

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java
##########
@@ -420,35 +460,56 @@ public void doProcessBatch(VectorizedRowBatch batch, 
boolean isFirstGroupingSet,
       //Flush if memory limits were reached
       // We keep flushing until the memory is under threshold
       int preFlushEntriesCount = numEntriesHashTable;
-      while (shouldFlush(batch)) {
-        flush(false);
 
-        if(gcCanary.get() == null) {
-          gcCanaryFlushes++;
-          gcCanary = new SoftReference<Object>(new Object());
-        }
+      if (!lruCache) {
+        while (shouldFlush(batch)) {
+          flush(false);
+
+          if(gcCanary.get() == null) {
+            gcCanaryFlushes++;
+            gcCanary = new SoftReference<Object>(new Object());
+          }
 
-        //Validate that some progress is being made
-        if (!(numEntriesHashTable < preFlushEntriesCount)) {
-          if (LOG.isDebugEnabled()) {
-            LOG.debug(String.format("Flush did not progress: %d entries 
before, %d entries after",
-                preFlushEntriesCount,
-                numEntriesHashTable));
+          //Validate that some progress is being made
+          if (!(numEntriesHashTable < preFlushEntriesCount)) {
+            if (LOG.isDebugEnabled()) {
+              LOG.debug(String.format("Flush did not progress: %d entries 
before, %d entries after",
+                  preFlushEntriesCount,
+                  numEntriesHashTable));
+            }
+            break;
           }
-          break;
+          preFlushEntriesCount = numEntriesHashTable;
         }
-        preFlushEntriesCount = numEntriesHashTable;
+      } else {
+        checkAndFlushLRU(batch);
       }
 
       if (sumBatchSize == 0 && 0 != batch.size) {
         // Sample the first batch processed for variable sizes.
         updateAvgVariableSize(batch);
+        currMemUsed = numEntriesHashTable * (fixedHashEntrySize + 
avgVariableSize);

Review comment:
       this is strange...there is a `currMemUsed` field an there is also a 
`currMemUsed` local variable in `shouldFlush` - they might cause things to me 
more interesting :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 458606)
    Time Spent: 50m  (was: 40m)

> Improve key evictions in VectorGroupByOperator
> ----------------------------------------------
>
>                 Key: HIVE-23843
>                 URL: https://issues.apache.org/jira/browse/HIVE-23843
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Keys in {{mapKeysAggregationBuffers}} are evicted in random order. Tasks also 
> get into GC issues when multiple keys are involved in groupbys. It would be 
> good to provide an option to have LRU based eviction for 
> mapKeysAggregationBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23843) Improve key evictions in VectorGroupByOperator

Reply via email to