[
https://issues.apache.org/jira/browse/ARROW-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954440#comment-15954440
]
Benjamin Zaitlen commented on ARROW-762:
----------------------------------------
Apologies, was sidetracked by some other work.
Running on Centos 7.2 and HDP 2.4.3
The following worked for me:
1. export ARROW_LIBHDFS_DIR=/usr/hdp/2.4.3.0-227/usr/lib/
2. export CLASSPATH=$CLASSPATH:`hdfs classpath --glob`
I think the LIBHDFS searching in Arrow is fairly exhaustive but docs pointing
to places common to CDH/HDP/MapR would probably be helpful. I was only able to
figure out part 2 with your note on checking what Tensorflow does. I
eventually came across this page: https://www.tensorflow.org/deploy/hadoop .
Something similar or a link on the Arrow docs would also be helpful.
Also, happy to help add to docs but if it's easy for you please go ahead. I'll
leave it to you to close the issue
> Kerberos Problem with PyArrow
> -----------------------------
>
> Key: ARROW-762
> URL: https://issues.apache.org/jira/browse/ARROW-762
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.2.0
> Environment: Centos 7.2, HDP 2.4.3
> Reporter: Benjamin Zaitlen
>
> I'm having trouble using pyarrow with kerberos. I'm trying to connect to
> HDFS with the following signature:
> ```
> hdfs = HdfsClient(host='ip-172-31-53-87.ec2.internal', port=8020,
> kerb_ticket='/tmp/krb5cc_1000', driver='libhdfs3', user='centos')
> ArrowException Traceback (most recent call last)
> <ipython-input-2-15087f93c239> in <module>()
> ----> 1 hdfs = HdfsClient(host='ip-172-31-53-87.ec2.internal', port=8020,
> kerb_ticket='/tmp/krb5cc_1000', driver='libhdfs3', user='centos')
> /home/centos/miniconda3/envs/hdfs_test/lib/python3.5/site-packages/pyarrow/filesystem.py
> in __init__(self, host, port, user, kerb_ticket, driver)
> 168 def __init__(self, host="default", port=0, user=None,
> kerb_ticket=None,
> 169 driver='libhdfs'):
> --> 170 self._connect(host, port, user, kerb_ticket, driver)
> 171
> 172 @implements(Filesystem.isdir)
> /home/centos/miniconda3/envs/hdfs_test/lib/python3.5/site-packages/pyarrow/io.pyx
> in pyarrow.io._HdfsClient._connect
> (/feedstock_root/build_artefacts/pyarrow_1488727736041/work/arrow-f6924ad83bc95741f003830892ad4815ca3b70fd/python/build/temp.linux-x86_64-3.5/io.cxx:11090)()
> /home/centos/miniconda3/envs/hdfs_test/lib/python3.5/site-packages/pyarrow/error.pyx
> in pyarrow.error.check_status
> (/feedstock_root/build_artefacts/pyarrow_1488727736041/work/arrow-f6924ad83bc95741f003830892ad4815ca3b70fd/python/build/temp.linux-x86_64-3.5/error.cxx:1197)()
> ArrowException: IOError: HDFS connection failed
> ```
> Below shows a valid ticket:
> ```
> [centos@ip-172-31-61-224 usr]$ klist
> Ticket cache: FILE:/tmp/krb5cc_1000
> Default principal: centos@DOMAIN
> Valid starting Expires Service principal
> 04/03/2017 14:36:38 04/04/2017 14:36:38 krbtgt/DOMAIN@DOMAIN
> renew until 04/10/2017 14:36:38
> ```
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)