[ https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392963#comment-14392963 ]
Cheolsoo Park commented on SPARK-6662: -------------------------------------- [~srowen], thank you for your comment. {quote} Wouldn't you be able to query for the YARN RM address somewhere and include it in the config? {quote} In typical cloud deployment, there is usually shared gateway from where users can connect to various clusters, and there is few Spark configs shared by all the clusters. Furthermore, clusters are usually transient in cloud, so I'd like to avoid adding any cluster-specific information to Spark configs. My current workaround is grep'ing {{yarn.resourcemanager.hostname}} from yarn-site.xml in my custom job launch script on the gateway and passing it via {{--conf}} option in every job launch. The intention was to get rid of this hacky bit in my launch script. {quote} I am somewhat concerned about adding a narrow bit of support for one particular substitution, which in turn is to support a specific assumption in one type of deployment. {quote} Yes, I understand your concern. Even though I have a specific problem to solve at hand, I filed this jira hoping that general variable substitution will be added to Spark config. In fact, I made an attempt in that direction but quickly ran into the following problems: # Adding general vars sub to Spark conf doesn't solve my problem. Since Spark config and Yarn config are separate entities in Spark, I cannot cross-refer to properties from one to the other. # Alternatively, I could introduce a special logic for {{spark.yarn.historyServer.address}} assuming the RM and HS are on the same node. Since Spark AM already knows the RM address, it is trivial to implement. But this makes a even more specific assumption about the deployment. Looks to me that it involves quite a bit of refactoring to implement general vars sub that allows cross-referring. So I compromised. That is, I introduced vars sub only to the {{spark.yarn.}} properties. In fact, vars sub already work for {{spark.hadoop.}} properties. If you look at the code, all the {{spark.hadoop.}} properties are already copied over to Yarn config and read via Yarn config. As a side effect, they support vars sub. I am just expanding the scope of this *secret* feature to {{spark.yarn.}} properties. For now, I can live with my current workaround. But I wanted to point out that it is not user-friendly to ask users to pass explicit hostname and port number to make use of HS. In fact, I'm not aware of any other property that causes same pain in YARN mode. For eg, the RM address for {{spark.master}} is dynamically picked up from yarn-site.xml. The HS address should be handled in a similar manner IMO. Hope this explains my thought process well enough. > Allow variable substitution in spark.yarn.historyServer.address > --------------------------------------------------------------- > > Key: SPARK-6662 > URL: https://issues.apache.org/jira/browse/SPARK-6662 > Project: Spark > Issue Type: Wish > Components: YARN > Affects Versions: 1.3.0 > Reporter: Cheolsoo Park > Priority: Minor > Labels: yarn > > In Spark on YARN, explicit hostname and port number need to be set for > "spark.yarn.historyServer.address" in SparkConf to make the HISTORY link. If > the history server address is known and static, this is usually not a problem. > But in cloud, that is usually not true. Particularly in EMR, the history > server always runs on the same node as with RM. So I could simply set it to > {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is > allowed. > In fact, Hadoop configuration already implements variable substitution, so if > this property is read via YarnConf, this can be easily achievable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org