> On Dec. 15, 2016, 11:49 a.m., Jonathan Hurley wrote: > > ambari-server/src/main/java/org/apache/ambari/server/controller/internal/CalculatedStatus.java, > > lines 282-287 > > <https://reviews.apache.org/r/53686/diff/5/?file=1584766#file1584766line282> > > > > Would this count states like FAILED / TIMEDOUT twice? > > Jaimin Jetly wrote: > yes. It will increment the FAILED counter and also increment COMPLETED > counter which is desired.
Why is this desired? Double counting them doesn't seem right. > On Dec. 15, 2016, 11:49 a.m., Jonathan Hurley wrote: > > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java, > > lines 445-447 > > <https://reviews.apache.org/r/53686/diff/5/?file=1584761#file1584761line445> > > > > This could be problematic if the merge fails when the transaction > > commits. You've already fired the event. > > Jaimin Jetly wrote: > persistAction method has been annotated with @Transactional and > @TransactionalLock. > I believe if the publisher method is marked to be transactional then the > event listener will also fall in that transactional boundary considering that > the eventbus being used over here is default which I believe is synchronous. > > I will verify this myself by checking that publsiher method, eventbus and > subscriber method are running on same thread. > > Jaimin Jetly wrote: > I verified that the publisher and subscriber code runs on same thread and > so transactional boundary will extend for the entire logic making sure that > tasks and respective stage/request are consistent in their status Although it may be true that the event bus is synchronous here, the point is that the merge() which comes before it doesn't necessarily count until the entire transaction is committed. By firing the event in this code, you're assuming that the merge() will succeed once the batches operations are flushed and the transaction is commited. If the listener modifies any sort of in-memory state, then this causes a data integrity problem. On Dec. 15, 2016, 11:49 a.m., Jaimin Jetly wrote: > > One thing I don't quite see here (and it could be due to the size of the > > patch) is what happens in these two cases: > > - Something goes wrong when trying to store a task's status. How does the > > system recover and mark it completed? > > - What about waiting until a request is HOLDING and then restarting Ambari > > - will the relevent maps get re-populated? > > Jaimin Jetly wrote: > >> - Something goes wrong when trying to store a task's status. How does > the system recover and mark it completed? > > This work only adds logic to add/update stage and request status. The way > task status is being updated or the logic for system to recover from anything > that goes wrong when storing task status is not changed. > This work ensures that task status update, respective stage status update > and respective request status update happens inside same transactional > boundary. Thus all three entities remains consistent in the status they show. > This work does not add any recovery logic and piggybacks on existing > failure recovery mechanism for updating task status. Thus if something goes > wrong storing task status then stage/request status will also be not store > and vice-versa. Next time when ambari-agent sends command reports again then > task update and respective stage/request status update should also also get > updated successfully. > > >> - What about waiting until a request is HOLDING and then restarting > Ambari - will the relevent maps get re-populated? > > Yes, everytime ambari-server starts, we check for all stages in > HostRoleStatus.IN_PROGRESS_STATUSES and publish an event with the tasks. > these will repopulate the maps. The patch adds that logic with > "publishInProgressTasks(stages)" method in ActionScheduler.java > > I have tested that scenario of restarting ambari-server when a request is > ongoing with in progress tasks and validating that stages and requests status > is correctly updated > > Sid Wagle wrote: > The agent re-sending command reports are we generally fault tolerant in > that area? Sicne you are probably mostly up-to-date on that part of the code > can you shed some light on loss of task status scenario. Do we recover the > correct status? > > Jaimin Jetly wrote: > Yes. Largely. > Although There are following specific cases: > 1) If persisting of "request, it's stages and it's host role commands" > for the first time does not happen then the request will be completely missed > and ambari-server will never log in the database or track it in anyways. > 2) If "request, it's stages and it's host role commands" are persisted > but subsequent updates and DB merge keeps failing then request will be > eventually declared timeout and so will all hrcs > > As far as DB merge from subsequent agent command reports is successfully > committed to DB, earlier failures will be recovered. But agent status / heartbeat processing is asynchronous - so once we receive the status and try to process it, the agent has already delivered its payload. So, if there's a failure processing the completed status, it will never be marked as completed. How does it recover in this scenario - how does the correct status get updated in the DB since the agent doesn't re-send a delivered command status? - Jonathan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53686/#review159315 ----------------------------------------------------------- On Jan. 18, 2017, 5:33 p.m., Jaimin Jetly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/53686/ > ----------------------------------------------------------- > > (Updated Jan. 18, 2017, 5:33 p.m.) > > > Review request for Ambari, Jonathan Hurley, Nate Cole, Sumit Mohanty, and Sid > Wagle. > > > Bugs: AMBARI-18868 > https://issues.apache.org/jira/browse/AMBARI-18868 > > > Repository: ambari > > > Description > ------- > > Stage and Request status should be persisted in the database. > > upgrading to ambari-3.0.0 should add status for all present stages and > request for the cluster. > > > Diffs > ----- > > > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java > 7837a7b > > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionScheduler.java > dabcb98 > > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleStatus.java > 3656bfe > > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/Request.java > 31e11c1 > > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/Stage.java > 4a05b32 > > ambari-server/src/main/java/org/apache/ambari/server/controller/internal/CalculatedStatus.java > 3c415df > > ambari-server/src/main/java/org/apache/ambari/server/events/TaskCreateEvent.java > PRE-CREATION > ambari-server/src/main/java/org/apache/ambari/server/events/TaskEvent.java > PRE-CREATION > > ambari-server/src/main/java/org/apache/ambari/server/events/TaskUpdateEvent.java > PRE-CREATION > > ambari-server/src/main/java/org/apache/ambari/server/events/listeners/tasks/TaskStatusListener.java > PRE-CREATION > > ambari-server/src/main/java/org/apache/ambari/server/events/publishers/TaskEventPublisher.java > PRE-CREATION > > ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java > 02c4091 > > ambari-server/src/main/java/org/apache/ambari/server/orm/dao/RequestDAO.java > 1c4d0a3 > ambari-server/src/main/java/org/apache/ambari/server/orm/dao/StageDAO.java > d2f899f > > ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java > 74271b9 > > ambari-server/src/main/java/org/apache/ambari/server/orm/entities/RequestEntity.java > 7944d21 > > ambari-server/src/main/java/org/apache/ambari/server/orm/entities/StageEntity.java > f9c8810 > > ambari-server/src/main/java/org/apache/ambari/server/orm/entities/StageEntityPK.java > 9ca0470 > > ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog300.java > 4f90ef3 > ambari-server/src/main/resources/Ambari-DDL-Derby-CREATE.sql b79c945 > ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql 1c502bc > ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql c6d4ad0 > ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql 1be87bb > ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql abe48e8 > ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 169a464 > > ambari-server/src/test/java/org/apache/ambari/server/actionmanager/TestActionDBAccessorImpl.java > 1ca777d > > ambari-server/src/test/java/org/apache/ambari/server/actionmanager/TestActionScheduler.java > 6cc511e > > ambari-server/src/test/java/org/apache/ambari/server/controller/internal/UpgradeResourceProviderTest.java > a702e6f > > ambari-server/src/test/java/org/apache/ambari/server/controller/internal/UpgradeSummaryResourceProviderTest.java > e398a54 > > ambari-server/src/test/java/org/apache/ambari/server/events/listeners/tasks/TaskStatusListenerTest.java > PRE-CREATION > > ambari-server/src/test/java/org/apache/ambari/server/orm/dao/UpgradeDAOTest.java > ae85241 > > ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java > 7dd9932 > > ambari-server/src/test/java/org/apache/ambari/server/upgrade/UpgradeCatalog300Test.java > d7979e8 > > Diff: https://reviews.apache.org/r/53686/diff/ > > > Testing > ------- > > Verified manually on a cluster by making api requests and upgrading ambari. > Add unit tests. > Verified that the patch does not break any existing unit tests on dev box. > Jenkins job overall unit test result pending.. > Verified on 1000 node cluster that the patch does not regress big operations. > Executed Stop Services and Start Services API call which gernerated around > 9000 tasks and compared request completion time for these operations. There > was a minor performance gain with the patch. As part of > https://issues.apache.org/jira/browse/AMBARI-18889, I will look if we can use > the request status and stage status to further enhance performance. > > > Thanks, > > Jaimin Jetly > >
