casperhart commented on issue #45369:
URL: https://github.com/apache/arrow/issues/45369#issuecomment-2696010277
The docker setup I used is fairly complex so I tried to simplify it down to
this:
```
ARG TARGET_PLATFORM=linux/arm64
FROM --platform=${TARGET_PLATFORM} ubuntu:22.04
ENV ARCH=arm64
ENV DEBIAN_FRONTEND=noninteractive
# Set environment variables
ENV HADOOP_VERSION=3.4.1
ENV HADOOP_HOME=/opt/hadoop
ENV HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
ENV PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
# Install dependencies
RUN apt-get update && apt-get install -y \
openjdk-11-jdk \
wget \
ssh \
pdsh \
python3 \
python3-pip \
python3-dev \
build-essential
# Install PyArrow and other Python packages
RUN pip install --no-cache-dir pyarrow
# Download and set up Hadoop
RUN wget
https://downloads.apache.org/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
\
&& tar -xzf hadoop-${HADOOP_VERSION}.tar.gz \
&& mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME} \
&& rm hadoop-${HADOOP_VERSION}.tar.gz
# Set up JAVA_HOME in Hadoop config
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-$ARCH
# make sure libhdfs.so is in the right place
RUN if [ ! -f /opt/hadoop/lib/native/libhdfs.so ]; then \
echo "ERROR: required_file.txt not found!" && \
exit 1; \
fi
RUN python3 -c "\
import os;\
path='/opt/hadoop/lib/native/libhdfs.so';\
print('path ', path, 'exists: ', os.path.exists(path));\
import pyarrow.fs as fs;\
hdfs = fs.HadoopFileSystem('0.0.0.0')"
```
I know the actual issue is not with arrow, but with installing the
non-aarch64 hadoop version, it's just the error message from arrow is
misleading. I'll try building hadoop from source on the mac but I don't have
high hopes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]