Jim Crist commented on ARROW-448:

> Does this rely on any environment variables to work (maybe {{HADOOP_HOME}}?)

The documentation on this is a bit unclear, but I believe it just relies on the 
classpath being properly set to include the conf directory (see "Common 
Problems" section here 
 For example, the first path in my `hadoop classpath --glob` is 
`/etc/hadoop/conf`, which is where the configuration files live. Since pyarrow 
automatically sets this using `hadoop` then everything works fine if you're 
using the libhdfs driver.


However, libhdfs3 doesn't work that way. From reading the source, it looks like 
it looks for "dfs.default.uri" in the path specified by the `LIBHDFS3_CONF` 
environment variable, and falls back to `localhost:8020` 
 This is why hdfs3 needed to implement the auto-configuration stuff itself, as 
the underlying libhdfs3 library doesn't do autoconfiguration the same way.


In my experience the auto-configuration in hdfs3 has been unreliable, but could 
be emulated in simple cases. The exact ordering of setting overrides isn't 
clear to me, but the order of preference seems to be `dfs.nameservices`, 
`dfs.namenode.rpc-address`, `fs.defaultFS`, which are found in a combination of 
`core-site.xml` and `hdfs-site.xml`.

> [Python] Load HdfsClient default options from core-site.xml or hdfs-site.xml, 
> if available
> ------------------------------------------------------------------------------------------
>                 Key: ARROW-448
>                 URL: https://issues.apache.org/jira/browse/ARROW-448
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
> This will yield a nicer user experience for some users

This message was sent by Atlassian JIRA

Reply via email to