[ 
https://issues.apache.org/jira/browse/ARROW-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7486:
--------------------------------
    Summary: [Python] Allow HDFS FileSystem to be created without Hadoop 
present  (was: Allow HDFS FileSystem to be created without Hadoop present)

> [Python] Allow HDFS FileSystem to be created without Hadoop present
> -------------------------------------------------------------------
>
>                 Key: ARROW-7486
>                 URL: https://issues.apache.org/jira/browse/ARROW-7486
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Matthew Rocklin
>            Priority: Minor
>
> I would like to be able to construct an HDFS FileSystem object on a machine 
> without Hadoop installed.  I don't need it to be able to actually do 
> anything.  I just need creating it to not fail.
> This would enable Dask users to run computations on an HDFS enabled cluster 
> from outside of that cluster.  This almost works today.  We send a small 
> computation to a worker (which has HDFS access) to generate the task graph 
> for loading data, and then we bring that task graph back to the local 
> machine, continue building on it, and then finally submit everything off to 
> the workers for execution.
> The flaw here is when we bring back the task graph from the worker back to 
> the client.  It contains a reference to a PyArrow HDFSFileSystem object, 
> which upon de-serialization calls _maybe_set_hadoop_classpath().  I suspect 
> that if this was allowed to fail that things would work out ok for us.  
> Downstream issue originally reported here: 
> https://github.com/dask/dask/issues/5758



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to