[ 
https://issues.apache.org/jira/browse/ARROW-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaqub Alwan updated ARROW-8240:
-------------------------------
    Description: 
I'll preface this with the limited setup I had to do:


{{export CLASSPATH=$(hadoop classpath --glob)}}

{{export 
ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}

 
Then I ran the following:

{code}
In [1]: import pyarrow.fs                                                       
                                                                                
                                                                              

In [2]: c = pyarrow.fs.HadoopFileSystem()                                       
                                                                                
                                                                              

In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')                          
                                                                                
                                                                              

In [4]: c.get_target_stats(sel)                                                 
                                                                                
                                                                              
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)

~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs.FileSystem.get_target_stats()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

OSError: HDFS list directory failed, errno: 2 (No such file or directory)

In [5]: sel = pyarrow.fs.FileSelector('.')                                      
                                                                                
                                                                              

In [6]: c.get_target_stats(sel)                                                 
                                                                                
                                                                              
Out[6]: 
[<FileStats for 'sandeep': type=FileType.Directory>,
 <FileStats for 'venv': type=FileType.Directory>,
 <FileStats for 'sample.py': type=FileType.File, size=506>]

In [7]: !ls                                                                     
                                                                                
                                                                              
sample.py  sandeep  venv

In [8]:   
{code}

It looks like the new hadoop fs interface is doing a local lookup?

Ok fine...

{code}
In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have 
to do this                                                                      
                                                                                

In [9]: c.get_target_stats(sel)                                                 
                                                                                
                                                                              
hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
        at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-9-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)

~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs.FileSystem.get_target_stats()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

OSError: HDFS list directory failed, errno: 22 (Invalid argument)

In [10]:   
{code}

and heres the rub

{code}
In [10]: c = pyarrow.hdfs.HadoopFileSystem()                                    
                                                                                
                                                                              
20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable

In [11]: c.ls('/user/rwiumli')                                                  
                                                                                
                                                                              
Out[11]: 
['hdfs://nameservice/user/rwiumli/.Trash',
 'hdfs://nameservice/user/rwiumli/.sparkStaging',
 'hdfs://nameservice/user/rwiumli/.staging',
 'hdfs://nameservice/user/rwiumli/acceptance',
 'hdfs://nameservice/user/rwiumli/copy_test',
 'hdfs://nameservice/user/rwiumli/hive-site.xml',
 'hdfs://nameservice/user/rwiumli/mli',
 'hdfs://nameservice/user/rwiumli/model_63702762843888.txt',
 'hdfs://nameservice/user/rwiumli/oozie-oozi',
 'hdfs://nameservice/user/rwiumli/sqoop',
 'hdfs://nameservice/user/rwiumli/test',
 'hdfs://nameservice/user/rwiumli/test_all.yml',
 'hdfs://nameservice/user/rwiumli/user']

In [12]:   
{code}

Finally, system info:

{code}
In [12]: !python --version                                                      
                                                                                
                                                                              
Python 3.6.8

In [13]: !pip list                                                              
                                                                                
                                                                              
Package          Version
---------------- -------
backcall         0.1.0  
decorator        4.4.1  
ipython          7.12.0 
ipython-genutils 0.2.0  
jedi             0.16.0 
joblib           0.14.1 
lightgbm         2.3.1  
numpy            1.18.1 
parso            0.6.1  
pexpect          4.8.0  
pickleshare      0.7.5  
pip              20.0.2 
prompt-toolkit   3.0.3  
ptyprocess       0.6.0  
pyarrow          0.16.0 
Pygments         2.5.2  
scikit-learn     0.22.1 
scipy            1.4.1  
setuptools       45.1.0 
six              1.14.0 
traitlets        4.3.3  
wcwidth          0.1.8  
wheel            0.34.2 

In [14]:  
{code}

  was:
I'll preface this with the limited setup I had to do:


{{export CLASSPATH=$(hadoop classpath --glob)}}

{{export 
ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}

 
Then I ran the following:

{{code}}
In [1]: import pyarrow.fs                                                       
                                                                                
                                                                              

In [2]: c = pyarrow.fs.HadoopFileSystem()                                       
                                                                                
                                                                              

In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')                          
                                                                                
                                                                              

In [4]: c.get_target_stats(sel)                                                 
                                                                                
                                                                              
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)

~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs.FileSystem.get_target_stats()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

OSError: HDFS list directory failed, errno: 2 (No such file or directory)

In [5]: sel = pyarrow.fs.FileSelector('.')                                      
                                                                                
                                                                              

In [6]: c.get_target_stats(sel)                                                 
                                                                                
                                                                              
Out[6]: 
[<FileStats for 'sandeep': type=FileType.Directory>,
 <FileStats for 'venv': type=FileType.Directory>,
 <FileStats for 'sample.py': type=FileType.File, size=506>]

In [7]: !ls                                                                     
                                                                                
                                                                              
sample.py  sandeep  venv

In [8]:   
{{code}}

It looks like the new hadoop fs interface is doing a local lookup?

Ok fine...

{{code}}
In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have 
to do this                                                                      
                                                                                

In [9]: c.get_target_stats(sel)                                                 
                                                                                
                                                                              
hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
        at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-9-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)

~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs.FileSystem.get_target_stats()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

OSError: HDFS list directory failed, errno: 22 (Invalid argument)

In [10]:   
{{code}}

and heres the rub

{{code}}
In [10]: c = pyarrow.hdfs.HadoopFileSystem()                                    
                                                                                
                                                                              
20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable

In [11]: c.ls('/user/rwiumli')                                                  
                                                                                
                                                                              
Out[11]: 
['hdfs://nameservice/user/rwiumli/.Trash',
 'hdfs://nameservice/user/rwiumli/.sparkStaging',
 'hdfs://nameservice/user/rwiumli/.staging',
 'hdfs://nameservice/user/rwiumli/acceptance',
 'hdfs://nameservice/user/rwiumli/copy_test',
 'hdfs://nameservice/user/rwiumli/hive-site.xml',
 'hdfs://nameservice/user/rwiumli/mli',
 'hdfs://nameservice/user/rwiumli/model_63702762843888.txt',
 'hdfs://nameservice/user/rwiumli/oozie-oozi',
 'hdfs://nameservice/user/rwiumli/sqoop',
 'hdfs://nameservice/user/rwiumli/test',
 'hdfs://nameservice/user/rwiumli/test_all.yml',
 'hdfs://nameservice/user/rwiumli/user']

In [12]:   
{{code}}

Finally, system info:

{{code}}
In [12]: !python --version                                                      
                                                                                
                                                                              
Python 3.6.8

In [13]: !pip list                                                              
                                                                                
                                                                              
Package          Version
---------------- -------
backcall         0.1.0  
decorator        4.4.1  
ipython          7.12.0 
ipython-genutils 0.2.0  
jedi             0.16.0 
joblib           0.14.1 
lightgbm         2.3.1  
numpy            1.18.1 
parso            0.6.1  
pexpect          4.8.0  
pickleshare      0.7.5  
pip              20.0.2 
prompt-toolkit   3.0.3  
ptyprocess       0.6.0  
pyarrow          0.16.0 
Pygments         2.5.2  
scikit-learn     0.22.1 
scipy            1.4.1  
setuptools       45.1.0 
six              1.14.0 
traitlets        4.3.3  
wcwidth          0.1.8  
wheel            0.34.2 

In [14]:  
{{code}}


> [Python] New FS interface (pyarrow.fs) does not seem to work correctly for 
> HDFS (Python 3.6, pyarrow 0.16.0)
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8240
>                 URL: https://issues.apache.org/jira/browse/ARROW-8240
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Yaqub Alwan
>            Priority: Major
>
> I'll preface this with the limited setup I had to do:
> {{export CLASSPATH=$(hadoop classpath --glob)}}
> {{export 
> ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}}
>  
> Then I ran the following:
> {code}
> In [1]: import pyarrow.fs                                                     
>                                                                               
>                                                                               
>     
> In [2]: c = pyarrow.fs.HadoopFileSystem()                                     
>                                                                               
>                                                                               
>     
> In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')                        
>                                                                               
>                                                                               
>     
> In [4]: c.get_target_stats(sel)                                               
>                                                                               
>                                                                               
>     
> ---------------------------------------------------------------------------
> OSError                                   Traceback (most recent call last)
> <ipython-input-4-f92157e01e47> in <module>
> ----> 1 c.get_target_stats(sel)
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs.FileSystem.get_target_stats()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> OSError: HDFS list directory failed, errno: 2 (No such file or directory)
> In [5]: sel = pyarrow.fs.FileSelector('.')                                    
>                                                                               
>                                                                               
>     
> In [6]: c.get_target_stats(sel)                                               
>                                                                               
>                                                                               
>     
> Out[6]: 
> [<FileStats for 'sandeep': type=FileType.Directory>,
>  <FileStats for 'venv': type=FileType.Directory>,
>  <FileStats for 'sample.py': type=FileType.File, size=506>]
> In [7]: !ls                                                                   
>                                                                               
>                                                                               
>     
> sample.py  sandeep  venv
> In [8]:   
> {code}
> It looks like the new hadoop fs interface is doing a local lookup?
> Ok fine...
> {code}
> In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have 
> to do this                                                                    
>                                                                               
>     
> In [9]: c.get_target_stats(sel)                                               
>                                                                               
>                                                                               
>     
> hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
> IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
> file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
> expected: file:///
>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
>       at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
>       at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
> hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
> IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: 
> file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, 
> expected: file:///
>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
>       at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
>       at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667)
> ---------------------------------------------------------------------------
> OSError                                   Traceback (most recent call last)
> <ipython-input-9-f92157e01e47> in <module>
> ----> 1 c.get_target_stats(sel)
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs.FileSystem.get_target_stats()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> OSError: HDFS list directory failed, errno: 22 (Invalid argument)
> In [10]:   
> {code}
> and heres the rub
> {code}
> In [10]: c = pyarrow.hdfs.HadoopFileSystem()                                  
>                                                                               
>                                                                               
>     
> 20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> In [11]: c.ls('/user/rwiumli')                                                
>                                                                               
>                                                                               
>     
> Out[11]: 
> ['hdfs://nameservice/user/rwiumli/.Trash',
>  'hdfs://nameservice/user/rwiumli/.sparkStaging',
>  'hdfs://nameservice/user/rwiumli/.staging',
>  'hdfs://nameservice/user/rwiumli/acceptance',
>  'hdfs://nameservice/user/rwiumli/copy_test',
>  'hdfs://nameservice/user/rwiumli/hive-site.xml',
>  'hdfs://nameservice/user/rwiumli/mli',
>  'hdfs://nameservice/user/rwiumli/model_63702762843888.txt',
>  'hdfs://nameservice/user/rwiumli/oozie-oozi',
>  'hdfs://nameservice/user/rwiumli/sqoop',
>  'hdfs://nameservice/user/rwiumli/test',
>  'hdfs://nameservice/user/rwiumli/test_all.yml',
>  'hdfs://nameservice/user/rwiumli/user']
> In [12]:   
> {code}
> Finally, system info:
> {code}
> In [12]: !python --version                                                    
>                                                                               
>                                                                               
>     
> Python 3.6.8
> In [13]: !pip list                                                            
>                                                                               
>                                                                               
>     
> Package          Version
> ---------------- -------
> backcall         0.1.0  
> decorator        4.4.1  
> ipython          7.12.0 
> ipython-genutils 0.2.0  
> jedi             0.16.0 
> joblib           0.14.1 
> lightgbm         2.3.1  
> numpy            1.18.1 
> parso            0.6.1  
> pexpect          4.8.0  
> pickleshare      0.7.5  
> pip              20.0.2 
> prompt-toolkit   3.0.3  
> ptyprocess       0.6.0  
> pyarrow          0.16.0 
> Pygments         2.5.2  
> scikit-learn     0.22.1 
> scipy            1.4.1  
> setuptools       45.1.0 
> six              1.14.0 
> traitlets        4.3.3  
> wcwidth          0.1.8  
> wheel            0.34.2 
> In [14]:  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to