Jim Crist created ARROW-2081: -------------------------------- Summary: Hdfs client isn't fork-safe Key: ARROW-2081 URL: https://issues.apache.org/jira/browse/ARROW-2081 Project: Apache Arrow Issue Type: Bug Reporter: Jim Crist
Given the following script: {code:java} import multiprocessing as mp import pyarrow as pa def ls(h): print("calling ls") return h.ls("/tmp") if __name__ == '__main__': h = pa.hdfs.connect() print("Using 'spawn'") pool = mp.get_context('spawn').Pool(2) results = pool.map(ls, [h, h]) sol = h.ls("/tmp") for r in results: assert r == sol print("'spawn' succeeded\n") print("Using 'fork'") pool = mp.get_context('fork').Pool(2) results = pool.map(ls, [h, h]) sol = h.ls("/tmp") for r in results: assert r == sol print("'fork' succeeded") {code} Results in the following output: {code:java} $ python test.py Using 'spawn' calling ls calling ls 'spawn' succeeded Using 'fork{code} The process then hangs, and I have to `kill -9` the forked worker processes. I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a problem with libhdfs or just arrow's use of it (a quick google search didn't turn up anything useful). -- This message was sent by Atlassian JIRA (v7.6.3#76005)