[ 
https://issues.apache.org/jira/browse/ARROW-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13141:
------------------------------------------
    Description: 
In the "legacy" python-specific HadoopFileSystem implementation, we have a 
{{_maybe_set_hadoop_classpath}} function which has some logic to set the 
{{CLASSPATH}} environment variable based on {{HADOOP_HOME}} or the hadoop 
executable: 
https://github.com/apache/arrow/blob/c43fab3d621bedef15470a1be43570be2026af20/python/pyarrow/hdfs.py#L134-L149

This is also mentioned in the documentation of the new HadoopFileSystem 
(https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs):

> If CLASSPATH is not set, then it will be set automatically if the hadoop 
> executable is in your system path, or if HADOOP_HOME is set.

However, this sentence was probably simply copied over from the docs about the 
legacy filesystem. And for the new HadoopFileSystem implementation, we don't 
have this logic to automatically set up {{CLASSPATH}}. 

Do we want to add this logic to the new implementation as well? (in cython, or 
actually in C++?) Or if not, we should update the docs to clarify that 
{{CLASSPATH}} is actually required.

cc [~apitrou]

  was:
I the "legacy" HadoopFileSystem implementation, we have a 
{{_maybe_set_hadoop_classpath}} function which has some logic to set the 
{{CLASSPATH}} environment variable based on {{HADOOP_HOME}} or the hadoop 
executable: 
https://github.com/apache/arrow/blob/c43fab3d621bedef15470a1be43570be2026af20/python/pyarrow/hdfs.py#L134-L149

This is also mentioned in the documentation of the new HadoopFileSystem 
(https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs):

> If CLASSPATH is not set, then it will be set automatically if the hadoop 
> executable is in your system path, or if HADOOP_HOME is set.

However, this sentence was probably simply copied over from the docs about the 
legacy filesystem. And for the new HadoopFileSystem implementation, we don't 
have this logic to automatically set up {{CLASSPATH}}. 

Do we want to add this logic to the new implementation as well? (in cython, or 
actually in C++?) Or if not, we should update the docs to clarify that 
{{CLASSPATH}} is actually required.

cc [~apitrou]


> [C++][Python] HadoopFileSystem: automatically set CLASSPATH based on 
> HADOOP_HOME env variable?
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13141
>                 URL: https://issues.apache.org/jira/browse/ARROW-13141
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: filesystem, hdfs
>
> In the "legacy" python-specific HadoopFileSystem implementation, we have a 
> {{_maybe_set_hadoop_classpath}} function which has some logic to set the 
> {{CLASSPATH}} environment variable based on {{HADOOP_HOME}} or the hadoop 
> executable: 
> https://github.com/apache/arrow/blob/c43fab3d621bedef15470a1be43570be2026af20/python/pyarrow/hdfs.py#L134-L149
> This is also mentioned in the documentation of the new HadoopFileSystem 
> (https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs):
> > If CLASSPATH is not set, then it will be set automatically if the hadoop 
> > executable is in your system path, or if HADOOP_HOME is set.
> However, this sentence was probably simply copied over from the docs about 
> the legacy filesystem. And for the new HadoopFileSystem implementation, we 
> don't have this logic to automatically set up {{CLASSPATH}}. 
> Do we want to add this logic to the new implementation as well? (in cython, 
> or actually in C++?) Or if not, we should update the docs to clarify that 
> {{CLASSPATH}} is actually required.
> cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to