Jim Crist created ARROW-2025:
--------------------------------
Summary: [Python/C++] HDFS Client disconnect closes all open
clients
Key: ARROW-2025
URL: https://issues.apache.org/jira/browse/ARROW-2025
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Reporter: Jim Crist
In the python library, if an instance of `HadoopFileSystem` is garbage
collected, all other existing instances become invalid. I haven't checked with
a C++ only example, but from reading the cython code I can't see how cython is
responsible, so I think this is a bug in the C++ library.
{code:java}
>>> import pyarrow as pa
>>> h = pa.hdfs.connect()
18/01/24 16:54:25 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
18/01/24 16:54:26 WARN shortcircuit.DomainSocketFactory: The short-circuit
local reads feature cannot be used because libhadoop cannot be loaded.
>>> h.ls("/")
['/benchmarks', '/hbase', '/tmp', '/user', '/var']
>>> h2 = pa.hdfs.connect()
>>> del h # close one client
>>> h2.ls("/") # all filesystem operations now fail
hdfsListDirectory(/): FileSystem#listStatus error:
IOException: Filesystem closedjava.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:865)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2106)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2092)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:743)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:113)
at
org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:808)
at
org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:804)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:804)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.6/site-packages/pyarrow/hdfs.py", line 88, in ls
return super(HadoopFileSystem, self).ls(path, detail)
File "io-hdfs.pxi", line 248, in pyarrow.lib.HadoopFileSystem.ls
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS: list directory failed
>>> h2.is_open # The python object still thinks it's open
True
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)