if the remote filesystem is visible from the other, than a different HDFS 
value, e.g hdfs://analytics:8000/historical/  can be used for reads & writes, 
even if your defaultFS (the one where you get max performance) is, say 
hdfs://processing:8000/

-performance will be slower, in both directions
-if you have a fast pipe between the two clusters, then a job with many 
executors may unintentionally saturate the network, leading to unhappy people 
elsewhere.
-you'd better have mutual trust at the kerberos layer. There's a configuration 
option (I forget its name) to give spark-submit a list of hdfs namenodes it 
will need to get tokens from. Unless your spark cluster is being launched with 
keytabs, you will need to list upfront all hdfs clusters your job intends to 
work with

On 4 Dec 2016, at 21:45, ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:


Hi

Is it possible to access hive tables sitting on multiple clusters in a single 
spark application?

We have a data processing cluster and analytics cluster. I want to join a table 
from analytics cluster with another table in processing cluster and finally 
write back in analytics cluster.

Best
Ayan

Reply via email to