[ 
https://issues.apache.org/jira/browse/HIVE-26947?focusedWorklogId=840552&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-840552
 ]

ASF GitHub Bot logged work on HIVE-26947:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Jan/23 09:17
            Start Date: 20/Jan/23 09:17
    Worklog Time Spent: 10m 
      Work Description: akshat0395 commented on code in PR #3955:
URL: https://github.com/apache/hive/pull/3955#discussion_r1082268550


##########
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java:
##########
@@ -118,19 +119,23 @@ public void run() {
           singleRun.cancel(true);
           executor.shutdownNow();
           executor = getTimeoutHandlingExecutor();
+          err = true;
         } catch (ExecutionException e) {
           LOG.info("Exception during executing compaction", e);
+          err = true;
         } catch (InterruptedException ie) {
           // do not ignore interruption requests
           return;
+        } catch (Throwable t) {
+          err = true;
         }
 
         doPostLoopActions(System.currentTimeMillis() - startedAt);
 
         // If we didn't try to launch a job it either means there was no work 
to do or we got
-        // here as the result of a communication failure with the DB.  Either 
way we want to wait
+        // here as the result of an error like communication failure with the 
DB, schema failures etc.  Either way we want to wait
         // a bit before, otherwise we can start over the loop immediately.
-        if (!launchedJob && !stop.get()) {
+        if ((!launchedJob || err) && !stop.get()) {

Review Comment:
   Before 
https://github.com/apache/hive/pull/3916/files#diff-c12003e4d6f63a4f8e0aa57aaa808fbf6066789bf4426f6d31123e9399570fd3
 change launched Job was used to decide weather thread should go to sleep of 
not, but that still wasnt suffice to avoid instant respawn and trying to make 
connection to HMS. This flag ensure that in case of error/Exception we will 
wait for a dedicated sleep time before again trying to make connection.
   cc @veghlaci05 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 840552)
    Time Spent: 2h  (was: 1h 50m)

> Hive compactor.Worker can respawn connections to HMS at extremely high 
> frequency
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-26947
>                 URL: https://issues.apache.org/jira/browse/HIVE-26947
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Akshat Mathur
>            Assignee: Akshat Mathur
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> After catching the exception generated by the findNextCompactionAndExecute() 
> task, HS2 appears to immediately rerun the task with no delay or backoff.  As 
> a result there are ~3500 connection attempts from HS2 to HMS over just a 5 
> second period in the HS2 log
> The compactor.Worker should wait between failed attempts and maybe do an 
> exponential backoff.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to