[
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-2113:
--------------------------------
Fix Version/s: (was: 0.9.0)
0.10.0
> [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the
> classpath setting HDFS logic
> -----------------------------------------------------------------------------------------------------
>
> Key: ARROW-2113
> URL: https://issues.apache.org/jira/browse/ARROW-2113
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.8.0
> Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH
> 5.13.1
> Reporter: Michal Danko
> Priority: Major
> Fix For: 0.10.0
>
>
> Steps to replicate the issue:
> mkdir /tmp/test
> cd /tmp/test
> mkdir jars
> cd jars
> touch test1.jar
> mkdir -p ../lib/zookeeper
> cd ../lib/zookeeper
> ln -s ../../jars/test1.jar ./test1.jar
> ln -s test1.jar test.jar
> mkdir -p ../hadoop/lib
> cd ../hadoop/lib
> ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
> import pyarrow.hdfs as hdfs;
> fs = hdfs.connect(user="hdfs")
>
> Ends with error:
> ------------
> loadFileSystems error:
> (unable to get root cause for java.lang.NoClassDefFoundError)
> (unable to get stack trace for java.lang.NoClassDefFoundError)
> hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0,
> kerbTicketCachePath=(NULL), userName=pa) error:
> (unable to get root cause for java.lang.NoClassDefFoundError)
> (unable to get stack trace for java.lang.NoClassDefFoundError)
> Traceback (most recent call last): (
> File "<stdin>", line 1, in <module>
> File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line
> 170, in connect
> kerb_ticket=kerb_ticket, driver=driver)
> File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line
> 37, in __init__
> self._connect(host, port, user, kerb_ticket, driver)
> File "pyarrow/io-hdfs.pxi", line 87, in
> pyarrow.lib.HadoopFileSystem._connect
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
> File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
> pyarrow.lib.ArrowIOError: HDFS connection failed
> -------------
>
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
> python
> import pyarrow.hdfs as hdfs;
> fs = hdfs.connect(user="hdfs")
>
> Works properly.
>
> I can't find reason why first CLASSPATH doesn't work and second one does,
> because it's path to same .jar, just with extra symlink in it. To me, it
> looks like pyarrow.lib.check has problem with symlinks defined with many
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie
> workflows.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)