[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999415#comment-15999415
 ] 

Yuechen Chen commented on SPARK-20608:
--------------------------------------

[~vanzin] By my knowledge, Spark will gain token in two scenario:
1) When submitted to YARN, Spark client will gain token for every 
spark.yarn.access.namenodes
2) When Spark Application is running on YARN, Spark will renew tokens at 
regular intervals.
So for the problem that if standby namenode actually becomes active, Spark will 
haven't right token may not be a problem. (I dont know the frequency of renew..)
I have to say, even if configed as active and standby addresses, the active 
namenode make failover when one spark job has already been writing data into 
remote HDFS, things will fail eventually.
But by my config, user does not need to care which namenode is active when they 
submit spark. Without my config, user should hard code 
spark.yarn.access.namenodes=hdfs://activeNamenode, which I think is not 
graceful.

About "Doesn't it work if you add the namespace (not the NN addresses) in the 
config instead?"
Can you give some actual examples?


> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20608
>                 URL: https://issues.apache.org/jira/browse/SPARK-20608
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Submit, YARN
>    Affects Versions: 2.0.1, 2.1.0
>            Reporter: Yuechen Chen
>            Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to