> On Feb. 24, 2016, 6:56 p.m., Alejandro Fernandez wrote: > > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py, > > line 123 > > <https://reviews.apache.org/r/43948/diff/1/?file=1267791#file1267791line123> > > > > If this happens during cluster install, why don't we put a dependency > > in role_command_order.json that RM must start after ATS. > > > > If ATS is on host1 and RM on host2, and during fresh cluster install we > > fail to install ATS, then RM will keep waiting. > > Sebastian Toader wrote: > role_command_order.json won't work with Blueprints as with Blueprint > there is no clustre wide ordering. > > RM will keep waiting only until it exhausts the retries (8 * 20 secs) > > Alejandro Fernandez wrote: > Can we make Blueprints respect role_command_order? > Please include Robert Nettleton in the code review. > > Andrew Onischuk wrote: > Alejandro, we made BP not to respect RCO to speed up deployments for one > of the users, which is really critical about the timings. And if we revert > that change gonna run into that problem for him again. > > Alejandro Fernandez wrote: > I think this is a fix for one item in a larger picture. If BP doesn't > respect RC0, then there's bound to be far many more errors like this related > to ordering. > In which case, we may spend a lot of effort trying to add hacks > components to keep retrying certain operations because other components on > other hosts are not fully up; > think of Hive, Spark, History Server, Tez trying to upload tarballs to > HDFS which may not be ready yet. > > A more flexible way of fixing this is to enable auto-start for this > environment. If RM fails because ATS hasn't yet created directories in HDFS, > then keep retrying RM. That's a simpler and more general solution. > > @Sumit Mohanty, what do you think?
I think it's ok for 2.2.2 but we should create a Jira for 2.4 to handle the case of blueprints ignoring RCO more generally. Would you mind adding some python comments so we know why it was added? I don't know if there's a way to make that check only if the cluster was installed via blueprints - Alejandro ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/43948/#review120537 ----------------------------------------------------------- On Feb. 25, 2016, 1:14 p.m., Sebastian Toader wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/43948/ > ----------------------------------------------------------- > > (Updated Feb. 25, 2016, 1:14 p.m.) > > > Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit > Mohanty, and Sid Wagle. > > > Bugs: AMBARI-15158 > https://issues.apache.org/jira/browse/AMBARI-15158 > > > Repository: ambari > > > Description > ------- > > If ATS is installed than Resource Manager after starting will check if the > directories where ATS will store time line data for active and completed > applications exists in DFS. There migh tbe cases when RM comes up much > earlier than ATS creating these directories. In these situations RM will stop > with "IOException: /ats/active does not exist" error message. > > In order to avoid this situation the pythin script responsible for starting > RM component has been modified to check the existence of these directories > upfront before the RM process is started. This check is performed only if ATS > is installed and have either > yarn.timeline-service.entity-group-fs-store.active-dir or > yarn.timeline-service.entity-group-fs-store.done-dir set. > > > Diffs > ----- > > > ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py > b73ae56 > > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py > 2ef404d > > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py > ec7799e > > Diff: https://reviews.apache.org/r/43948/diff/ > > > Testing > ------- > > Manual testing: > 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS > were deployed to different nodes. This was tested with both cases when HDFS > has webhdfs enabled and disabled. > 2. Created a cluster using the UI where NN, RM and ATS were deployed to > different nodes. After the cluster was kerberized and was tested with both > cases when HDFS has webhdfs enabled and disabled. > > Python tests results: > ---------------------------------------------------------------------- > Total run:902 > Total errors:0 > Total failures:0 > OK > > > Thanks, > > Sebastian Toader > >
