-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46434/
-----------------------------------------------------------
(Updated ápr. 22, 2016, 12:42 du)
Review request for Ambari, Alejandro Fernandez, Miklos Gergely, Oliver Szabo,
Sandor Magyari, and Sebastian Toader.
Changes
-------
Review fixes
Bugs: AMBARI-15991
https://issues.apache.org/jira/browse/AMBARI-15991
Repository: ambari
Description
-------
If upgrade process takes longer than expected, DataNode and RegionServer is
reported as failed. It happens because it needs more time to finish update.
The fix for RegionServer checks if the process is running and if it is so, then
it is not considered as a failure.
For DataNode the process is also checked and if it is running then check is
repeated 2 times with 5 minutes wait. I had a limitation here, python scripts
are allowed to run for 20 minutes by default and this checking takes 16 mins (2
minutes initial check, 5 minutes sleep if there is a failure, 2 minutes
regaular check, 5 minutes sleep, 2 minutes final check).
If more time is needed, then default value of *server.task.timeout* and number
of repetition in 5 minutes check should be increased.
Diffs (updated)
-----
ambari-server/src/main/resources/common-services/HBASE/0.96.0.2.0/package/scripts/upgrade.py
01a8156
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/datanode_upgrade.py
8f36001
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/params_linux.py
7ad9f39
ambari-server/src/test/python/stacks/2.0.6/HBASE/test_hbase_regionserver.py
8d187ec
ambari-server/src/test/python/stacks/2.0.6/HDFS/test_datanode.py 78b8171
Diff: https://reviews.apache.org/r/46434/diff/
Testing
-------
I did manual testing on this:
For RegionServer the process check is tested.
For DataNodes I made an intentional exception to see if it keeps waiting. (this
is how I ran into the 20 minutes server task timeout)
----------------------------------------------------------------------
Total run:970
Total errors:0
Total failures:0
OK
Thanks,
Daniel Gergely