Matthew Rocklin created ARROW-7486:
--------------------------------------
Summary: Allow HDFS FileSystem to be created without Hadoop present
Key: ARROW-7486
URL: https://issues.apache.org/jira/browse/ARROW-7486
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Matthew Rocklin
I would like to be able to construct an HDFS FileSystem object on a machine
without Hadoop installed. I don't need it to be able to actually do anything.
I just need creating it to not fail.
This would enable Dask users to run computations on an HDFS enabled cluster
from outside of that cluster. This almost works today. We send a small
computation to a worker (which has HDFS access) to generate the task graph for
loading data, and then we bring that task graph back to the local machine,
continue building on it, and then finally submit everything off to the workers
for execution.
The flaw here is when we bring back the task graph from the worker back to the
client. It contains a reference to a PyArrow HDFSFileSystem object, which upon
de-serialization calls _maybe_set_hadoop_classpath(). I suspect that if this
was allowed to fail that things would work out ok for us.
Downstream issue originally reported here:
https://github.com/dask/dask/issues/5758
--
This message was sent by Atlassian Jira
(v8.3.4#803005)