Re: Review Request 53686: Stage and Request status should be persisted in the database

Jaimin Jetly Mon, 13 Feb 2017 11:20:25 -0800


> On Dec. 15, 2016, 4:49 p.m., Jonathan Hurley wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java,
> >  lines 445-447
> > <https://reviews.apache.org/r/53686/diff/5/?file=1584761#file1584761line445>
> >
> >     This could be problematic if the merge fails when the transaction 
> > commits. You've already fired the event.
> 
> Jaimin Jetly wrote:
>     persistAction method has been annotated with @Transactional and 
> @TransactionalLock.
>     I believe if the publisher method is marked to be transactional then the 
> event listener will also fall in that transactional boundary considering that 
> the eventbus being used over here is default which I believe is synchronous. 
>     
>     I will verify this myself by checking that publsiher method, eventbus and 
> subscriber method are running on same thread.
> 
> Jaimin Jetly wrote:
>     I verified that the publisher and subscriber code runs on same thread and 
> so transactional boundary will extend for the entire logic making sure that 
> tasks and respective stage/request are consistent in their status
> 
> Jonathan Hurley wrote:
>     Although it may be true that the event bus is synchronous here, the point 
> is that the merge() which comes before it doesn't necessarily count until the 
> entire transaction is committed. By firing the event in this code, you're 
> assuming that the merge() will succeed once the batches operations are 
> flushed and the transaction is commited. If the listener modifies any sort of 
> in-memory state, then this causes a data integrity problem.


Thats true. 
The in-memory state can become inconsistent in that case. Meaning in-memory 
state of the listener can have most updated status data but DB that failed to 
merge can have stale status data. Although all data in the DB even though stale 
is consistent even at this point of time.

Now listener will again try to update DB on next heratbeat and will get 
consistent with DB when merge happens next time.


> On Dec. 15, 2016, 4:49 p.m., Jonathan Hurley wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/controller/internal/CalculatedStatus.java,
> >  lines 282-287
> > <https://reviews.apache.org/r/53686/diff/5/?file=1584766#file1584766line282>
> >
> >     Would this count states like FAILED / TIMEDOUT twice?
> 
> Jaimin Jetly wrote:
>     yes. It will increment the FAILED counter and also increment COMPLETED 
> counter which is desired.
> 
> Jonathan Hurley wrote:
>     Why is this desired? Double counting them doesn't seem right.

The COMPLETED key is represents any completed state 
hrc.getStatus().isCompletedState() and thats how it's used in subsequent 
function.
So on getting FAILED state we are incrementing both completed and Failed

This code is a copy of existing code logic: 
https://github.com/apache/ambari/blob/release-2.4.2/ambari-server/src/main/java/org/apache/ambari/server/controller/internal/CalculatedStatus.java#L235-L240


On Dec. 15, 2016, 4:49 p.m., Jaimin Jetly wrote:
> > One thing I don't quite see here (and it could be due to the size of the 
> > patch) is what happens in these two cases:
> > - Something goes wrong when trying to store a task's status. How does the 
> > system recover and mark it completed?
> > - What about waiting until a request is HOLDING and then restarting Ambari 
> > - will the relevent maps get re-populated?
> 
> Jaimin Jetly wrote:
>     >> - Something goes wrong when trying to store a task's status. How does 
> the system recover and mark it completed?
>     
>     This work only adds logic to add/update stage and request status. The way 
> task status is being updated or the logic for system to recover from anything 
> that goes wrong when storing task status is not changed.
>     This work ensures that task status update, respective stage status update 
> and respective request status update happens inside same transactional 
> boundary. Thus all three entities remains consistent in the status they show. 
>     This work does not add any recovery logic and piggybacks on existing 
> failure recovery mechanism for updating task status. Thus if something goes 
> wrong storing task status then stage/request status will also be not store 
> and vice-versa. Next time when ambari-agent sends command reports again then 
> task update and respective stage/request status update should also also get 
> updated successfully. 
>     
>     >> -  What about waiting until a request is HOLDING and then restarting 
> Ambari - will the relevent maps get re-populated?
>     
>     Yes, everytime ambari-server starts, we check for all stages in 
> HostRoleStatus.IN_PROGRESS_STATUSES and publish an event with the tasks. 
> these will repopulate the maps. The patch adds that logic with 
> "publishInProgressTasks(stages)" method in ActionScheduler.java
>     
>     I have tested that scenario of restarting ambari-server when a request is 
> ongoing with in progress tasks and validating that stages and requests status 
> is correctly updated
> 
> Sid Wagle wrote:
>     The agent re-sending command reports are we generally fault tolerant in 
> that area? Sicne you are probably mostly up-to-date on that part of the code 
> can you shed some light on loss of task status scenario. Do we recover the 
> correct status?
> 
> Jaimin Jetly wrote:
>     Yes. Largely. 
>     Although There are following specific cases:
>     1) If persisting of "request, it's stages and it's host role commands" 
> for the first time does not happen then the request will be completely missed 
> and ambari-server will never log in the database or track it in anyways.
>     2) If "request, it's stages and it's host role commands" are persisted 
> but subsequent updates and DB merge keeps failing then request will be 
> eventually declared timeout and so will all hrcs
>     
>     As far as DB merge from subsequent agent command reports is successfully 
> committed to DB, earlier failures will be recovered.
> 
> Jonathan Hurley wrote:
>     But agent status / heartbeat processing is asynchronous - so once we 
> receive the status and try to process it, the agent has already delivered its 
> payload. So, if there's a failure processing the completed status, it will 
> never be marked as completed. How does it recover in this scenario - how does 
> the correct status get updated in the DB since the agent doesn't re-send a 
> delivered command status?

ambari-server keeps scheduling commands for all stages that has in-progress 
tasks in the DB. So if there is a failure processing the completed status for 
any HRC then ambari-server will again ask for the status of that HRC next time 
and keep asking for it until the status is reported and successfuly merge as 
COMPLETED/FAILED or successfully merged as ABORTED/TIMEDOUT by ambari-server.
Existing code reference: 
https://github.com/apache/ambari/blob/release-2.4.2/ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionScheduler.java#L341


- Jaimin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53686/#review159315
-----------------------------------------------------------


On Feb. 6, 2017, 11:09 p.m., Jaimin Jetly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53686/
> -----------------------------------------------------------
> 
> (Updated Feb. 6, 2017, 11:09 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley, Nate Cole, Sumit Mohanty, and Sid 
> Wagle.
> 
> 
> Bugs: AMBARI-18868
>     https://issues.apache.org/jira/browse/AMBARI-18868
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Stage and Request status should be persisted in the database.
> 
> upgrading to ambari-3.0.0 should add status for all present stages and 
> request for the cluster.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
>  7837a7b 
>   
> ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionScheduler.java
>  dabcb98 
>   
> ambari-server/src/main/java/org/apache/ambari/server/actionmanager/Request.java
>  31e11c1 
>   
> ambari-server/src/main/java/org/apache/ambari/server/actionmanager/Stage.java 
> 4a05b32 
>   
> ambari-server/src/main/java/org/apache/ambari/server/controller/internal/CalculatedStatus.java
>  3c415df 
>   
> ambari-server/src/main/java/org/apache/ambari/server/events/TaskCreateEvent.java
>  PRE-CREATION 
>   ambari-server/src/main/java/org/apache/ambari/server/events/TaskEvent.java 
> PRE-CREATION 
>   
> ambari-server/src/main/java/org/apache/ambari/server/events/TaskUpdateEvent.java
>  PRE-CREATION 
>   
> ambari-server/src/main/java/org/apache/ambari/server/events/listeners/tasks/TaskStatusListener.java
>  PRE-CREATION 
>   
> ambari-server/src/main/java/org/apache/ambari/server/events/publishers/TaskEventPublisher.java
>  PRE-CREATION 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
>  02c4091 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/dao/RequestDAO.java 
> 1c4d0a3 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/dao/StageDAO.java 
> d2f899f 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
>  74271b9 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/entities/RequestEntity.java
>  7944d21 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/entities/StageEntity.java
>  f9c8810 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/entities/StageEntityPK.java
>  9ca0470 
>   
> ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog300.java
>  4f90ef3 
>   ambari-server/src/main/resources/Ambari-DDL-Derby-CREATE.sql b79c945 
>   ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql 1c502bc 
>   ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql c6d4ad0 
>   ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql 1be87bb 
>   ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql abe48e8 
>   ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 169a464 
>   
> ambari-server/src/test/java/org/apache/ambari/server/actionmanager/TestActionDBAccessorImpl.java
>  177ac70 
>   
> ambari-server/src/test/java/org/apache/ambari/server/actionmanager/TestActionScheduler.java
>  ade625a 
>   
> ambari-server/src/test/java/org/apache/ambari/server/alerts/AmbariPerformanceRunnableTest.java
>  7b1a5a2 
>   
> ambari-server/src/test/java/org/apache/ambari/server/controller/internal/UpgradeResourceProviderTest.java
>  a0701b6 
>   
> ambari-server/src/test/java/org/apache/ambari/server/controller/internal/UpgradeSummaryResourceProviderTest.java
>  619e367 
>   
> ambari-server/src/test/java/org/apache/ambari/server/events/listeners/tasks/TaskStatusListenerTest.java
>  PRE-CREATION 
>   
> ambari-server/src/test/java/org/apache/ambari/server/state/ConfigHelperTest.java
>  b1c10f5 
>   
> ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterEffectiveVersionTest.java
>  9d339e2 
>   
> ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java
>  ed95b0b 
>   
> ambari-server/src/test/java/org/apache/ambari/server/upgrade/UpgradeCatalog300Test.java
>  d7979e8 
> 
> Diff: https://reviews.apache.org/r/53686/diff/
> 
> 
> Testing
> -------
> 
> Verified manually on a cluster by making api requests and upgrading ambari.
> Add unit tests.
> Verified that the patch does not break any existing unit tests on dev box. 
> Jenkins job overall unit test result pending..
> Verified on 1000 node cluster that the patch does not regress big operations. 
> Executed Stop Services and Start Services API call which gernerated around 
> 9000 tasks and compared  request completion time for these operations. There 
> was a minor performance gain with the patch. As part of 
> https://issues.apache.org/jira/browse/AMBARI-18889, I will look if we can use 
> the request status and stage status to further enhance performance.
> 
> 
> Thanks,
> 
> Jaimin Jetly
> 
>

Re: Review Request 53686: Stage and Request status should be persisted in the database

Reply via email to