-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58109/
-----------------------------------------------------------

(Updated March 31, 2017, 1:22 p.m.)


Review request for Ambari, Alejandro Fernandez, Nate Cole, and Robert Levas.


Bugs: BUG-20646
    https://issues.apache.org/jira/browse/BUG-20646


Repository: ambari


Description
-------

When creating a massive request (a rolling upgrade on a cluster with 1000 
nodes), the size of the request seems to slow down the {{ActionScheduler}}. 
Each command was taking between 1 to 2 minutes to run (even server-side tasks). 

The cause of this can be seen in the following two stack traces:

{code:title=ActionSchedulerImpl}
        at 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(ActionDBAccessorImpl.java:84)
        at org.apache.ambari.server.actionmanager.Stage.<init>(Stage.java:157)
        at 
org.apache.ambari.server.actionmanager.StageFactoryImpl.createExisting(StageFactoryImpl.java:72)
        at 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getStagesInProgress(ActionDBAccessorImpl.java:303)
        at 
org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:341)
        at 
org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:302)
        at java.lang.Thread.run(Thread.java:745)
{code}

{code:title=Server Action Executor}
        at 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(ActionDBAccessorImpl.java:700)
        at 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(ActionDBAccessorImpl.java:84)
        at org.apache.ambari.server.actionmanager.Stage.<init>(Stage.java:157)
        at 
org.apache.ambari.server.actionmanager.StageFactoryImpl.createExisting(StageFactoryImpl.java:72)
        at 
org.apache.ambari.server.actionmanager.Request.<init>(Request.java:199)
        at 
org.apache.ambari.server.actionmanager.Request$$FastClassByGuice$$9071e03.newInstance(<generated>)
        at 
com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
        at 
com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:60)
        at 
com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:85)
        at 
com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:254)
        at 
com.google.inject.internal.InjectorImpl$4$1.call(InjectorImpl.java:978)
        at 
com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1024)
        at com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:974)
        at 
com.google.inject.assistedinject.FactoryProvider2.invoke(FactoryProvider2.java:632)
        at com.sun.proxy.$Proxy26.createExisting(Unknown Source)
        at 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getRequests(ActionDBAccessorImpl.java:784)
        at 
org.apache.ambari.server.serveraction.ServerActionExecutor.cleanRequestShareDataContexts(ServerActionExecutor.java:259)
        - locked <0x00007ff0a14083c8> (a java.util.HashMap)
        at 
org.apache.ambari.server.serveraction.ServerActionExecutor.doWork(ServerActionExecutor.java:454)
        at 
org.apache.ambari.server.serveraction.ServerActionExecutor$1.run(ServerActionExecutor.java:160)
        at java.lang.Thread.run(Thread.java:745)
{code}

It's clear from these stacks that every {{PENDING}} stage (roughly 15,000) were 
being loaded into memory every second (and their accompanying task as well). 
This makes no sense as these methods don't need all stages - just the _next_ 
stage. This is because all stages are synchronous within a single request.

The proposed solution is to fix the {{StageEntity.findByCommandStatuses}} call 
so it doesn't return every stage:
{code}
SELECT stage.requestid, 
       MIN(stage.stageid) 
FROM   stageentity stage, 
       hostrolecommandentity hrc 
WHERE  hrc.status IN :statuses 
       AND hrc.stageid = stage.stageid 
       AND hrc.requestid = stage.requestid 
GROUP  BY stage.requestid 
{code}


Diffs
-----

  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessor.java
 9325d03 
  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
 ab4feaa 
  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionScheduler.java
 0984c5c 
  ambari-server/src/main/java/org/apache/ambari/server/orm/dao/StageDAO.java 
5151fb3 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/StageEntity.java
 f68338f 
  
ambari-server/src/main/java/org/apache/ambari/server/serveraction/ServerActionExecutor.java
 b0be6b3 
  
ambari-server/src/test/java/org/apache/ambari/server/actionmanager/TestActionDBAccessorImpl.java
 81eef3b 
  
ambari-server/src/test/java/org/apache/ambari/server/actionmanager/TestActionScheduler.java
 2b5d2f3 
  
ambari-server/src/test/java/org/apache/ambari/server/orm/dao/RequestDAOTest.java
 9b62671 
  
ambari-server/src/test/java/org/apache/ambari/server/serveraction/ServerActionExecutorTest.java
 44d5b63 
  
ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java
 e2ce6e7 


Diff: https://reviews.apache.org/r/58109/diff/1/


Testing (updated)
-------

Tests run: 4976, Failures: 0, Errors: 0, Skipped: 39

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 17:49 min
[INFO] Finished at: 2017-03-31T12:58:22-04:00
[INFO] Final Memory: 59M/664M
[INFO] ------------------------------------------------------------------------


Thanks,

Jonathan Hurley

Reply via email to