[
https://issues.apache.org/jira/browse/ARROW-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-7486:
--------------------------------
Labels: hadoop (was: )
> [Python] Allow HDFS FileSystem to be created without Hadoop present
> -------------------------------------------------------------------
>
> Key: ARROW-7486
> URL: https://issues.apache.org/jira/browse/ARROW-7486
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Matthew Rocklin
> Priority: Minor
> Labels: hadoop
>
> I would like to be able to construct an HDFS FileSystem object on a machine
> without Hadoop installed. I don't need it to be able to actually do
> anything. I just need creating it to not fail.
> This would enable Dask users to run computations on an HDFS enabled cluster
> from outside of that cluster. This almost works today. We send a small
> computation to a worker (which has HDFS access) to generate the task graph
> for loading data, and then we bring that task graph back to the local
> machine, continue building on it, and then finally submit everything off to
> the workers for execution.
> The flaw here is when we bring back the task graph from the worker back to
> the client. It contains a reference to a PyArrow HDFSFileSystem object,
> which upon de-serialization calls _maybe_set_hadoop_classpath(). I suspect
> that if this was allowed to fail that things would work out ok for us.
> Downstream issue originally reported here:
> https://github.com/dask/dask/issues/5758
--
This message was sent by Atlassian Jira
(v8.3.4#803005)