[ 
https://issues.apache.org/jira/browse/HDFS-13894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603732#comment-16603732
 ] 

Íñigo Goiri commented on HDFS-13894:
------------------------------------

The setup we internally have is an HDFS cluster in Azure VMs where the Routers 
are exposed through a load balancer.
To access metadata we just point to the Load Balancer.
However, to access the data itself, we need to use HttpFs which uses WebHDFS to 
proxy the requests to the DNs.

In core-default.xml, we set:
{code}
  <property>
    <name>fs.hdfs.impl</name>
    <value>org.apache.hadoop.hdfs.HdfsWithProxyFileSystem</value>
  </property>
  <property>
    <name>fs.AbstractFileSystem.hdfs.impl</name>
    <value>org.apache.hadoop.fs.AbstractHdfsWithProxyFileSystem</value>
  </property>
  <property>
    <name>fs.hdfs.proxy.azure-cluster-fed</name>
    <value>webhdfs://loadbalancer.azure.com:<PROXY-PORT>/</value>
  </property>
{code}

In hdfs-site.xml, we set:
{code}
  <property>
    <name>dfs.nameservices</name>
    <value>azure-cluster-fed</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.azure-cluster-fed</name>
    <value>routerinternaladdress:<RPC-PORT></value>
  </property>
{code}

Then, the user sets the environment variable {{HDFS_USE_PROXY}} to {{true}} in 
the client machine.
The {{HdfsWithProxyFileSystem}} will use the proxy address in the client 
machine and the native HDFS address when running inside of the firewall.

> Access HDFS through a proxy and natively
> ----------------------------------------
>
>                 Key: HDFS-13894
>                 URL: https://issues.apache.org/jira/browse/HDFS-13894
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Íñigo Goiri
>            Assignee: Íñigo Goiri
>            Priority: Major
>         Attachments: HDFS-13894.000.patch
>
>
> HDFS deployments are usually behind a firewall where one can access the 
> Namenode but not the Datanodes. To mitigate this situation there are proxies 
> that catch the DN requests (e.g., HttpFS). However, if a user submits a job 
> using the HttpFS endpoint, all the workers will use such endpoint which will 
> usually be a bottleneck.
> We should create a new filesystem that supports accessing both:
> * HttpFS for submission from outside the firewal
> * HDFS from within the cluster



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to