[ https://issues.apache.org/jira/browse/ARROW-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou closed ARROW-8240. --------------------------------- Assignee: (was: Krisztian Szucs) Resolution: Works for Me I can't reproduce with PyArrow 4.0.0. [~yalwan-iqvia] If you still encounter this problem on the latest PyArrow version, feel free to ping. > [Python] New FS interface (pyarrow.fs) does not seem to work correctly for > HDFS (Python 3.6, pyarrow 0.16.0) > ------------------------------------------------------------------------------------------------------------ > > Key: ARROW-8240 > URL: https://issues.apache.org/jira/browse/ARROW-8240 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Yaqub Alwan > Priority: Major > Labels: HDFS, filesystem, hdfs > > I'll preface this with the limited setup I had to do: > {{export CLASSPATH=$(hadoop classpath --glob)}} > {{export > ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}} > > Then I ran the following: > {code} > In [1]: import pyarrow.fs > > > > In [2]: c = pyarrow.fs.HadoopFileSystem() > > > > In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli') > > > > In [4]: c.get_target_stats(sel) > > > > --------------------------------------------------------------------------- > OSError Traceback (most recent call last) > <ipython-input-4-f92157e01e47> in <module> > ----> 1 c.get_target_stats(sel) > ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in > pyarrow._fs.FileSystem.get_target_stats() > ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in > pyarrow.lib.pyarrow_internal_check_status() > ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > OSError: HDFS list directory failed, errno: 2 (No such file or directory) > In [5]: sel = pyarrow.fs.FileSelector('.') > > > > In [6]: c.get_target_stats(sel) > > > > Out[6]: > [<FileStats for 'sandeep': type=FileType.Directory>, > <FileStats for 'venv': type=FileType.Directory>, > <FileStats for 'sample.py': type=FileType.File, size=506>] > In [7]: !ls > > > > sample.py sandeep venv > In [8]: > {code} > It looks like the new hadoop fs interface is doing a local lookup? > Ok fine... > {code} > In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have > to do this > > > In [9]: c.get_target_stats(sel) > > > > hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error: > IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: > file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82) > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418) > hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error: > IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: > file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82) > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609) > at > org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667) > --------------------------------------------------------------------------- > OSError Traceback (most recent call last) > <ipython-input-9-f92157e01e47> in <module> > ----> 1 c.get_target_stats(sel) > ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in > pyarrow._fs.FileSystem.get_target_stats() > ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in > pyarrow.lib.pyarrow_internal_check_status() > ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > OSError: HDFS list directory failed, errno: 22 (Invalid argument) > In [10]: > {code} > and heres the rub > {code} > In [10]: c = pyarrow.hdfs.HadoopFileSystem() > > > > 20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > In [11]: c.ls('/user/rwiumli') > > > > Out[11]: > ['hdfs://nameservice/user/rwiumli/.Trash', > 'hdfs://nameservice/user/rwiumli/.sparkStaging', > 'hdfs://nameservice/user/rwiumli/.staging', > 'hdfs://nameservice/user/rwiumli/acceptance', > 'hdfs://nameservice/user/rwiumli/copy_test', > 'hdfs://nameservice/user/rwiumli/hive-site.xml', > 'hdfs://nameservice/user/rwiumli/mli', > 'hdfs://nameservice/user/rwiumli/model_63702762843888.txt', > 'hdfs://nameservice/user/rwiumli/oozie-oozi', > 'hdfs://nameservice/user/rwiumli/sqoop', > 'hdfs://nameservice/user/rwiumli/test', > 'hdfs://nameservice/user/rwiumli/test_all.yml', > 'hdfs://nameservice/user/rwiumli/user'] > In [12]: > {code} > Finally, system info: > {code} > In [12]: !python --version > > > > Python 3.6.8 > In [13]: !pip list > > > > Package Version > ---------------- ------- > backcall 0.1.0 > decorator 4.4.1 > ipython 7.12.0 > ipython-genutils 0.2.0 > jedi 0.16.0 > joblib 0.14.1 > lightgbm 2.3.1 > numpy 1.18.1 > parso 0.6.1 > pexpect 4.8.0 > pickleshare 0.7.5 > pip 20.0.2 > prompt-toolkit 3.0.3 > ptyprocess 0.6.0 > pyarrow 0.16.0 > Pygments 2.5.2 > scikit-learn 0.22.1 > scipy 1.4.1 > setuptools 45.1.0 > six 1.14.0 > traitlets 4.3.3 > wcwidth 0.1.8 > wheel 0.34.2 > In [14]: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)