[
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357341#comment-16357341
]
Wes McKinney commented on ARROW-2113:
-------------------------------------
I actually just remembered that we are setting that classpath from the output
of {{hadoop --classpath}}, see:
https://github.com/apache/arrow/blob/master/python/pyarrow/hdfs.py#L116
So the reason that this is failing in the first instance is that {{hadoop}} is
in the path, whereas in the second, it is setting the correct classpath. Either
way the CLASSPATH you have set does not appear to have the requisite JAR files
It seems we should be more specific about detecting that Hadoop JARs are in the
path. I will open a new bug report about this
> [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS
> connection failed"
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-2113
> URL: https://issues.apache.org/jira/browse/ARROW-2113
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.8.0
> Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH
> 5.13.1
> Reporter: Michal Danko
> Priority: Major
>
> Steps to replicate the issue:
> mkdir /tmp/test
> cd /tmp/test
> mkdir jars
> cd jars
> touch test1.jar
> mkdir -p ../lib/zookeeper
> cd ../lib/zookeeper
> ln -s ../../jars/test1.jar ./test1.jar
> ln -s test1.jar test.jar
> mkdir -p ../hadoop/lib
> cd ../hadoop/lib
> ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
> import pyarrow.hdfs as hdfs;
> fs = hdfs.connect(user="hdfs")
>
> Ends with error:
> ------------
> loadFileSystems error:
> (unable to get root cause for java.lang.NoClassDefFoundError)
> (unable to get stack trace for java.lang.NoClassDefFoundError)
> hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0,
> kerbTicketCachePath=(NULL), userName=pa) error:
> (unable to get root cause for java.lang.NoClassDefFoundError)
> (unable to get stack trace for java.lang.NoClassDefFoundError)
> Traceback (most recent call last): (
> File "<stdin>", line 1, in <module>
> File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line
> 170, in connect
> kerb_ticket=kerb_ticket, driver=driver)
> File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line
> 37, in __init__
> self._connect(host, port, user, kerb_ticket, driver)
> File "pyarrow/io-hdfs.pxi", line 87, in
> pyarrow.lib.HadoopFileSystem._connect
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
> File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
> pyarrow.lib.ArrowIOError: HDFS connection failed
> -------------
>
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
> python
> import pyarrow.hdfs as hdfs;
> fs = hdfs.connect(user="hdfs")
>
> Works properly.
>
> I can't find reason why first CLASSPATH doesn't work and second one does,
> because it's path to same .jar, just with extra symlink in it. To me, it
> looks like pyarrow.lib.check has problem with symlinks defined with many
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie
> workflows.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)