active does not exist

Alejandro Fernandez Fri, 26 Feb 2016 11:31:58 -0800


> On Feb. 24, 2016, 6:56 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py,
> >  line 123
> > <https://reviews.apache.org/r/43948/diff/1/?file=1267791#file1267791line123>
> >
> >     If this happens during cluster install, why don't we put a dependency 
> > in role_command_order.json that RM must start after ATS.
> >     
> >     If ATS is on host1 and RM on host2, and during fresh cluster install we 
> > fail to install ATS, then RM will keep waiting.
> 
> Sebastian Toader wrote:
>     role_command_order.json won't work with Blueprints as with Blueprint 
> there is no clustre wide ordering.
>     
>     RM will keep waiting only until it exhausts the retries (8 * 20 secs)
> 
> Alejandro Fernandez wrote:
>     Can we make Blueprints respect role_command_order?
>     Please include Robert Nettleton in the code review.
> 
> Andrew Onischuk wrote:
>     Alejandro, we made BP not to respect RCO to speed up deployments for one 
> of the users, which is really critical about the timings. And if we revert 
> that change gonna run into that problem for him again.
> 
> Alejandro Fernandez wrote:
>     I think this is a fix for one item in a larger picture. If BP doesn't 
> respect RC0, then there's bound to be far many more errors like this related 
> to ordering.
>     In which case, we may spend a lot of effort trying to add hacks 
> components to keep retrying certain operations because other components on 
> other hosts are not fully up;
>     think of Hive, Spark, History Server, Tez trying to upload tarballs to 
> HDFS which may not be ready yet.
>     
>     A more flexible way of fixing this is to enable auto-start for this 
> environment. If RM fails because ATS hasn't yet created directories in HDFS, 
> then keep retrying RM. That's a simpler and more general solution.
>     
>     @Sumit Mohanty, what do you think?


I think it's ok for 2.2.2 but we should create a Jira for 2.4 to handle the 
case of blueprints ignoring RCO more generally. Would you mind adding some 
python comments so we know why it was added? I don't know if there's a way to 
make that check only if the cluster was installed via blueprints


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43948/#review120537
-----------------------------------------------------------


On Feb. 25, 2016, 1:14 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43948/
> -----------------------------------------------------------
> 
> (Updated Feb. 25, 2016, 1:14 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Andrew Onischuk, Sumit 
> Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-15158
>     https://issues.apache.org/jira/browse/AMBARI-15158
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If ATS is installed than Resource Manager after starting will check if the 
> directories where ATS will store time line data for active and completed 
> applications exists in DFS. There migh tbe cases when RM comes up much 
> earlier than ATS creating these directories. In these situations RM will stop 
> with "IOException: /ats/active does not exist" error message.
> 
> In order to avoid this situation the pythin script responsible for starting 
> RM component has been modified to check the existence of these directories 
> upfront before the RM process is started. This check is performed only if ATS 
> is installed and have either 
> yarn.timeline-service.entity-group-fs-store.active-dir or 
> yarn.timeline-service.entity-group-fs-store.done-dir set.
> 
> 
> Diffs
> -----
> 
>   
> ambari-common/src/main/python/resource_management/libraries/providers/hdfs_resource.py
>  b73ae56 
>   
> ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py
>  2ef404d 
>   
> ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/resourcemanager.py
>  ec7799e 
> 
> Diff: https://reviews.apache.org/r/43948/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 1. Created secure/non-secure clusters with Blueprint where NN, RM and ATS 
> were deployed to different nodes. This was tested with both cases when HDFS 
> has webhdfs enabled and disabled.
> 2. Created a cluster using the UI where NN, RM and ATS were deployed to 
> different nodes. After the cluster was kerberized and was tested with both 
> cases when HDFS has webhdfs enabled and disabled.
> 
> Python tests results:
> ----------------------------------------------------------------------
> Total run:902
> Total errors:0
> Total failures:0
> OK
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>

Re: Review Request 43948: RM fails to start: IOException: /ats/active does not exist

Reply via email to