[
https://issues.apache.org/jira/browse/ARROW-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167943#comment-17167943
]
Michael Peleshenko commented on ARROW-5236:
-------------------------------------------
I've been having trouble connecting to HDFS even with the 1.0.0 pyarrow build
as I run into the below error when running:
{code:python}
pa.hdfs.connect(host="host", port=port, user="user", kerb_ticket="kerb_ticket")
{code}
{noformat}
File
"C:\ProgramData\Continuum\Anaconda\envs\pyarrow-test\lib\site-packages\pyarrow\hdfs.py",
line 210 in connect
extra_conf=extra_conf)
File
"C:\ProgramData\Continuum\Anaconda\envs\pyarrow-test\lib\site-packages\pyarrow\hdfs.py",
line 40, in __init__
self._connect(host, port, user, kerb_ticket, extra_conf)
File "pyarrow\io-hdfs.pxi", line 75, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status
OSError: Unable to load libjvm: The specified module could not be found.
{noformat}
I tried the workaround mentioned
[here|https://issues.apache.org/jira/browse/ARROW-5236?focusedCommentId=17106888&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17106888]
and got it working by copying jvm.dll into %JAVA_HOME%\lib\server\libjvm.so.
It seems the logic to find libjvm is following a Linux path for some reason.
Looking into the arrow internals, I came across this:
https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/cpp/src/arrow/io/hdfs_internal.cc#L176-L180
This looks like the same issue observed in ARROW-1003, except that one was for
libhdfs. In my situation, libhdfs is found as expected as hdfs.dll, so Windows
logic is definitely followed there.
https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/cpp/src/arrow/io/hdfs_internal.cc#L144-L145
I suspect a similar fix is needed here to change `__WIN32` to `_WIN32`.
> [Python] hdfs.connect() is trying to load libjvm in windows
> -----------------------------------------------------------
>
> Key: ARROW-5236
> URL: https://issues.apache.org/jira/browse/ARROW-5236
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Environment: Windows 7 Enterprise, pyarrow 0.13.0
> Reporter: Kamaraju
> Priority: Major
> Labels: hdfs
>
> This issue was originally reported at
> [https://github.com/apache/arrow/issues/4215] . Raising a Jira as per Wes
> McKinney's request.
> Summary:
> The following script
> {code}
> $ cat expt2.py
> import pyarrow as pa
> fs = pa.hdfs.connect()
> {code}
> tries to load libjvm in windows 7 which is not expected.
> {noformat}
> $ python ./expt2.py
> Traceback (most recent call last):
> File "./expt2.py", line 3, in <module>
> fs = pa.hdfs.connect()
> File
> "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
> line 183, in connect
> extra_conf=extra_conf)
> File
> "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
> line 37, in __init__
> self._connect(host, port, user, kerb_ticket, driver, extra_conf)
> File "pyarrow\io-hdfs.pxi", line 89, in
> pyarrow.lib.HadoopFileSystem._connect
> File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unable to load libjvm
> {noformat}
> There is no libjvm file in Windows Java installation.
> {noformat}
> $ echo $JAVA_HOME
> C:\Progra~1\Java\jdk1.8.0_141
> $ find $JAVA_HOME -iname '*libjvm*'
> <returns nothing.>
> {noformat}
> I see the libjvm error with both 0.11.1 and 0.13.0 versions of pyarrow.
> Steps to reproduce the issue (with more details):
> Create the environment
> {noformat}
> $ cat scratch_py36_pyarrow.yml
> name: scratch_py36_pyarrow
> channels:
> - defaults
> dependencies:
> - python=3.6.8
> - pyarrow
> {noformat}
> {noformat}
> $ conda env create -f scratch_py36_pyarrow.yml
> {noformat}
> Apply the following patch to lib/site-packages/pyarrow/hdfs.py . I had to do
> this since the Hadoop installation that comes with MapR <[https://mapr.com/]>
> windows client only has $HADOOP_HOME/bin/hadoop.cmd . There is no file named
> $HADOOP_HOME/bin/hadoop and so the subsequent subprocess.check_output call
> fails with FileNotFoundError if this patch is not applied.
> {noformat}
> $ cat ~/x/patch.txt
> 131c131
> < hadoop_bin = '{0}/bin/hadoop'.format(os.environ['HADOOP_HOME'])
> ---
> > hadoop_bin = '{0}/bin/hadoop.cmd'.format(os.environ['HADOOP_HOME'])
> $ patch
> /c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py
> ~/x/patch.txt
> patching file
> /c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py
> {noformat}
> Activate the environment
> {noformat}
> $ source activate scratch_py36_pyarrow
> {noformat}
> Sample script
> {noformat}
> $ cat expt2.py
> import pyarrow as pa
> fs = pa.hdfs.connect()
> {noformat}
> Execute the script
> {noformat}
> $ python ./expt2.py
> Traceback (most recent call last):
> File "./expt2.py", line 3, in <module>
> fs = pa.hdfs.connect()
> File
> "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
> line 183, in connect
> extra_conf=extra_conf)
> File
> "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
> line 37, in __init__
> self._connect(host, port, user, kerb_ticket, driver, extra_conf)
> File "pyarrow\io-hdfs.pxi", line 89, in
> pyarrow.lib.HadoopFileSystem._connect
> File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unable to load libjvm
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)