[
https://issues.apache.org/jira/browse/ARROW-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yaqub Alwan updated ARROW-8240:
-------------------------------
Description:
I'll preface this with the limited setup I had to do:
{{export CLASSPATH=$(hadoop classpath --glob)}}
{{export
ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}
Then I ran the following:
{code}
In [1]: import pyarrow.fs
In [2]: c = pyarrow.fs.HadoopFileSystem()
In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')
In [4]: c.get_target_stats(sel)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-4-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)
~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in
pyarrow._fs.FileSystem.get_target_stats()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
pyarrow.lib.pyarrow_internal_check_status()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
pyarrow.lib.check_status()
OSError: HDFS list directory failed, errno: 2 (No such file or directory)
In [5]: sel = pyarrow.fs.FileSelector('.')
In [6]: c.get_target_stats(sel)
Out[6]:
[<FileStats for 'sandeep': type=FileType.Directory>,
<FileStats for 'venv': type=FileType.Directory>,
<FileStats for 'sample.py': type=FileType.File, size=506>]
In [7]: !ls
sample.py sandeep venv
In [8]:
{code}
It looks like the new hadoop fs interface is doing a local lookup?
Ok fine...
{code}
In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have
to do this
In [9]: c.get_target_stats(sel)
hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected:
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected:
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
at
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-9-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)
~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in
pyarrow._fs.FileSystem.get_target_stats()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
pyarrow.lib.pyarrow_internal_check_status()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
pyarrow.lib.check_status()
OSError: HDFS list directory failed, errno: 22 (Invalid argument)
In [10]:
{code}
and heres the rub
{code}
In [10]: c = pyarrow.hdfs.HadoopFileSystem()
20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
In [11]: c.ls('/user/rwiumli')
Out[11]:
['hdfs://nameservice/user/rwiumli/.Trash',
'hdfs://nameservice/user/rwiumli/.sparkStaging',
'hdfs://nameservice/user/rwiumli/.staging',
'hdfs://nameservice/user/rwiumli/acceptance',
'hdfs://nameservice/user/rwiumli/copy_test',
'hdfs://nameservice/user/rwiumli/hive-site.xml',
'hdfs://nameservice/user/rwiumli/mli',
'hdfs://nameservice/user/rwiumli/model_63702762843888.txt',
'hdfs://nameservice/user/rwiumli/oozie-oozi',
'hdfs://nameservice/user/rwiumli/sqoop',
'hdfs://nameservice/user/rwiumli/test',
'hdfs://nameservice/user/rwiumli/test_all.yml',
'hdfs://nameservice/user/rwiumli/user']
In [12]:
{code}
Finally, system info:
{code}
In [12]: !python --version
Python 3.6.8
In [13]: !pip list
Package Version
---------------- -------
backcall 0.1.0
decorator 4.4.1
ipython 7.12.0
ipython-genutils 0.2.0
jedi 0.16.0
joblib 0.14.1
lightgbm 2.3.1
numpy 1.18.1
parso 0.6.1
pexpect 4.8.0
pickleshare 0.7.5
pip 20.0.2
prompt-toolkit 3.0.3
ptyprocess 0.6.0
pyarrow 0.16.0
Pygments 2.5.2
scikit-learn 0.22.1
scipy 1.4.1
setuptools 45.1.0
six 1.14.0
traitlets 4.3.3
wcwidth 0.1.8
wheel 0.34.2
In [14]:
{code}
was:
I'll preface this with the limited setup I had to do:
{{export CLASSPATH=$(hadoop classpath --glob)}}
{{export
ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}
Then I ran the following:
{{code}}
In [1]: import pyarrow.fs
In [2]: c = pyarrow.fs.HadoopFileSystem()
In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')
In [4]: c.get_target_stats(sel)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-4-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)
~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in
pyarrow._fs.FileSystem.get_target_stats()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
pyarrow.lib.pyarrow_internal_check_status()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
pyarrow.lib.check_status()
OSError: HDFS list directory failed, errno: 2 (No such file or directory)
In [5]: sel = pyarrow.fs.FileSelector('.')
In [6]: c.get_target_stats(sel)
Out[6]:
[<FileStats for 'sandeep': type=FileType.Directory>,
<FileStats for 'venv': type=FileType.Directory>,
<FileStats for 'sample.py': type=FileType.File, size=506>]
In [7]: !ls
sample.py sandeep venv
In [8]:
{{code}}
It looks like the new hadoop fs interface is doing a local lookup?
Ok fine...
{{code}}
In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have
to do this
In [9]: c.get_target_stats(sel)
hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected:
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected:
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
at
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-9-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)
~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in
pyarrow._fs.FileSystem.get_target_stats()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
pyarrow.lib.pyarrow_internal_check_status()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
pyarrow.lib.check_status()
OSError: HDFS list directory failed, errno: 22 (Invalid argument)
In [10]:
{{code}}
and heres the rub
{{code}}
In [10]: c = pyarrow.hdfs.HadoopFileSystem()
20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
In [11]: c.ls('/user/rwiumli')
Out[11]:
['hdfs://nameservice/user/rwiumli/.Trash',
'hdfs://nameservice/user/rwiumli/.sparkStaging',
'hdfs://nameservice/user/rwiumli/.staging',
'hdfs://nameservice/user/rwiumli/acceptance',
'hdfs://nameservice/user/rwiumli/copy_test',
'hdfs://nameservice/user/rwiumli/hive-site.xml',
'hdfs://nameservice/user/rwiumli/mli',
'hdfs://nameservice/user/rwiumli/model_63702762843888.txt',
'hdfs://nameservice/user/rwiumli/oozie-oozi',
'hdfs://nameservice/user/rwiumli/sqoop',
'hdfs://nameservice/user/rwiumli/test',
'hdfs://nameservice/user/rwiumli/test_all.yml',
'hdfs://nameservice/user/rwiumli/user']
In [12]:
{{code}}
Finally, system info:
{{code}}
In [12]: !python --version
Python 3.6.8
In [13]: !pip list
Package Version
---------------- -------
backcall 0.1.0
decorator 4.4.1
ipython 7.12.0
ipython-genutils 0.2.0
jedi 0.16.0
joblib 0.14.1
lightgbm 2.3.1
numpy 1.18.1
parso 0.6.1
pexpect 4.8.0
pickleshare 0.7.5
pip 20.0.2
prompt-toolkit 3.0.3
ptyprocess 0.6.0
pyarrow 0.16.0
Pygments 2.5.2
scikit-learn 0.22.1
scipy 1.4.1
setuptools 45.1.0
six 1.14.0
traitlets 4.3.3
wcwidth 0.1.8
wheel 0.34.2
In [14]:
{{code}}
> [Python] New FS interface (pyarrow.fs) does not seem to work correctly for
> HDFS (Python 3.6, pyarrow 0.16.0)
> ------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-8240
> URL: https://issues.apache.org/jira/browse/ARROW-8240
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Yaqub Alwan
> Priority: Major
>
> I'll preface this with the limited setup I had to do:
> {{export CLASSPATH=$(hadoop classpath --glob)}}
> {{export
> ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}
>
> Then I ran the following:
> {code}
> In [1]: import pyarrow.fs
>
>
>
> In [2]: c = pyarrow.fs.HadoopFileSystem()
>
>
>
> In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')
>
>
>
> In [4]: c.get_target_stats(sel)
>
>
>
> ---------------------------------------------------------------------------
> OSError Traceback (most recent call last)
> <ipython-input-4-f92157e01e47> in <module>
> ----> 1 c.get_target_stats(sel)
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in
> pyarrow._fs.FileSystem.get_target_stats()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
> pyarrow.lib.pyarrow_internal_check_status()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
> pyarrow.lib.check_status()
> OSError: HDFS list directory failed, errno: 2 (No such file or directory)
> In [5]: sel = pyarrow.fs.FileSelector('.')
>
>
>
> In [6]: c.get_target_stats(sel)
>
>
>
> Out[6]:
> [<FileStats for 'sandeep': type=FileType.Directory>,
> <FileStats for 'venv': type=FileType.Directory>,
> <FileStats for 'sample.py': type=FileType.File, size=506>]
> In [7]: !ls
>
>
>
> sample.py sandeep venv
> In [8]:
> {code}
> It looks like the new hadoop fs interface is doing a local lookup?
> Ok fine...
> {code}
> In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have
> to do this
>
>
> In [9]: c.get_target_stats(sel)
>
>
>
> hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
> IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected:
> file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli,
> expected: file:///
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
> hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
> IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected:
> file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli,
> expected: file:///
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667)
> ---------------------------------------------------------------------------
> OSError Traceback (most recent call last)
> <ipython-input-9-f92157e01e47> in <module>
> ----> 1 c.get_target_stats(sel)
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in
> pyarrow._fs.FileSystem.get_target_stats()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
> pyarrow.lib.pyarrow_internal_check_status()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in
> pyarrow.lib.check_status()
> OSError: HDFS list directory failed, errno: 22 (Invalid argument)
> In [10]:
> {code}
> and heres the rub
> {code}
> In [10]: c = pyarrow.hdfs.HadoopFileSystem()
>
>
>
> 20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> In [11]: c.ls('/user/rwiumli')
>
>
>
> Out[11]:
> ['hdfs://nameservice/user/rwiumli/.Trash',
> 'hdfs://nameservice/user/rwiumli/.sparkStaging',
> 'hdfs://nameservice/user/rwiumli/.staging',
> 'hdfs://nameservice/user/rwiumli/acceptance',
> 'hdfs://nameservice/user/rwiumli/copy_test',
> 'hdfs://nameservice/user/rwiumli/hive-site.xml',
> 'hdfs://nameservice/user/rwiumli/mli',
> 'hdfs://nameservice/user/rwiumli/model_63702762843888.txt',
> 'hdfs://nameservice/user/rwiumli/oozie-oozi',
> 'hdfs://nameservice/user/rwiumli/sqoop',
> 'hdfs://nameservice/user/rwiumli/test',
> 'hdfs://nameservice/user/rwiumli/test_all.yml',
> 'hdfs://nameservice/user/rwiumli/user']
> In [12]:
> {code}
> Finally, system info:
> {code}
> In [12]: !python --version
>
>
>
> Python 3.6.8
> In [13]: !pip list
>
>
>
> Package Version
> ---------------- -------
> backcall 0.1.0
> decorator 4.4.1
> ipython 7.12.0
> ipython-genutils 0.2.0
> jedi 0.16.0
> joblib 0.14.1
> lightgbm 2.3.1
> numpy 1.18.1
> parso 0.6.1
> pexpect 4.8.0
> pickleshare 0.7.5
> pip 20.0.2
> prompt-toolkit 3.0.3
> ptyprocess 0.6.0
> pyarrow 0.16.0
> Pygments 2.5.2
> scikit-learn 0.22.1
> scipy 1.4.1
> setuptools 45.1.0
> six 1.14.0
> traitlets 4.3.3
> wcwidth 0.1.8
> wheel 0.34.2
> In [14]:
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)