This is an automated email from the ASF dual-hosted git repository.
karan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new 2396ca88189 Fix hadoop3 ingestion example (#17921)
2396ca88189 is described below
commit 2396ca881899cf4fc1f8311d7090ce8948def0d8
Author: Zoltan Haindrich <[email protected]>
AuthorDate: Tue Apr 15 17:32:15 2025 +0200
Fix hadoop3 ingestion example (#17921)
* updates to docs/dockerfile/etc
* spellfix
---
docs/tutorials/tutorial-batch-hadoop.md | 75 +++++++++-------------
.../quickstart/tutorial/hadoop/docker/Dockerfile | 29 +++++----
.../quickstart/tutorial/hadoop/docker/bootstrap.sh | 14 ++++
.../tutorial/wikipedia-index-hadoop3.json | 3 +-
4 files changed, 63 insertions(+), 58 deletions(-)
diff --git a/docs/tutorials/tutorial-batch-hadoop.md
b/docs/tutorials/tutorial-batch-hadoop.md
index a0575036004..a71823544af 100644
--- a/docs/tutorials/tutorial-batch-hadoop.md
+++ b/docs/tutorials/tutorial-batch-hadoop.md
@@ -61,7 +61,6 @@ Let's create some folders under `/tmp`, we will use these
later when starting th
```bash
mkdir -p /tmp/shared
-mkdir -p /tmp/shared/hadoop_xml
```
### Configure /etc/hosts
@@ -77,24 +76,27 @@ On the host machine, add the following entry to
`/etc/hosts`:
Once the `/tmp/shared` folder has been created and the `etc/hosts` entry has
been added, run the following command to start the Hadoop container.
```bash
-docker run -it -h druid-hadoop-demo --name druid-hadoop-demo -p 2049:2049 -p
2122:2122 -p 8020:8020 -p 8021:8021 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p
8033:8033 -p 8040:8040 -p 8042:8042 -p 8088:8088 -p 8443:8443 -p 9000:9000 -p
10020:10020 -p 19888:19888 -p 34455:34455 -p 49707:49707 -p 50010:50010 -p
50020:50020 -p 50030:50030 -p 50060:50060 -p 50070:50070 -p 50075:50075 -p
50090:50090 -p 51111:51111 -v /tmp/shared:/shared druid-hadoop-demo:3.3.6
/etc/bootstrap.sh -bash
+docker run -it -h druid-hadoop-demo --name druid-hadoop-demo -p 2049:2049 -p
2122:2122 -p 8020-8042:8020-8042 -p 8088:8088 -p 8443:8443 -p 9000:9000 -p
9820:9820 -p 9860-9880:9860-9880 -p 10020:10020 -p 19888:19888 -p 34455:34455
-p 49707:49707 -p 50010:50010 -p 50020:50020 -p 50030:50030 -p 50060:50060 -p
50070:50070 -p 50075:50075 -p 50090:50090 -p 51111:51111 -v /tmp/shared:/shared
druid-hadoop-demo:3.3.6 /etc/bootstrap.sh -bash
```
Once the container is started, your terminal will attach to a bash shell
running inside the container:
```bash
-Starting sshd: [ OK ]
-18/07/26 17:27:15 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Starting namenodes on [druid-hadoop-demo]
-druid-hadoop-demo: starting namenode, logging to
/usr/local/hadoop/logs/hadoop-root-namenode-druid-hadoop-demo.out
-localhost: starting datanode, logging to
/usr/local/hadoop/logs/hadoop-root-datanode-druid-hadoop-demo.out
-Starting secondary namenodes [0.0.0.0]
-0.0.0.0: starting secondarynamenode, logging to
/usr/local/hadoop/logs/hadoop-root-secondarynamenode-druid-hadoop-demo.out
-18/07/26 17:27:31 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
-starting yarn daemons
-starting resourcemanager, logging to
/usr/local/hadoop/logs/yarn--resourcemanager-druid-hadoop-demo.out
-localhost: starting nodemanager, logging to
/usr/local/hadoop/logs/yarn-root-nodemanager-druid-hadoop-demo.out
-starting historyserver, logging to
/usr/local/hadoop/logs/mapred--historyserver-druid-hadoop-demo.out
+Starting datanodes
+Starting secondary namenodes [druid-hadoop-demo]
+WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of
YARN_CONF_DIR.
+WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
+Starting resourcemanager
+WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
+Starting nodemanagers
+WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
+localhost: WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of
YARN_OPTS.
+WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of
YARN_CONF_DIR.
+WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
+WARNING: Use of this script to start the MR JobHistory daemon is deprecated.
+WARNING: Attempting to execute replacement "mapred --daemon start" instead.
+ * initialize hdfs for first run
bash-4.1#
```
@@ -108,39 +110,18 @@ To open another shell to the Hadoop container, run the
following command:
docker exec -it druid-hadoop-demo bash
```
-### Copy input data to the Hadoop container
+### Test data
-From the `apache-druid-{{DRUIDVERSION}}` package root on the host, copy the
`quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` sample data to the
shared folder:
-
-```bash
-cp quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz
/tmp/shared/wikiticker-2015-09-12-sampled.json.gz
-```
-
-### Setup HDFS directories
-
-In the Hadoop container's shell, run the following commands to setup the HDFS
directories needed by this tutorial and copy the input data to HDFS.
-
-```bash
-cd /usr/local/hadoop/bin
-./hdfs dfs -mkdir /druid
-./hdfs dfs -mkdir /druid/segments
-./hdfs dfs -mkdir /quickstart
-./hdfs dfs -mkdir /user
-./hdfs dfs -chmod 777 /druid
-./hdfs dfs -chmod 777 /druid/segments
-./hdfs dfs -chmod 777 /quickstart
-./hdfs dfs -chmod -R 777 /tmp
-./hdfs dfs -chmod -R 777 /user
-./hdfs dfs -put /shared/wikiticker-2015-09-12-sampled.json.gz
/quickstart/wikiticker-2015-09-12-sampled.json.gz
-```
-
-If you encounter namenode errors when running this command, the Hadoop
container may not be finished initializing. When this occurs, wait a couple of
minutes and retry the commands.
+The startup script `bootstrap.sh`:
+* creates the necessary directories
+* loads an input file to HDFS
+* places the hadoop configuration into the shared volume as `hadoop-conf.tgz`
## Configure Druid to use Hadoop
Some additional steps are needed to configure the Druid cluster for Hadoop
batch indexing.
-### Copy Hadoop configuration to Druid classpath
+### Provide Hadoop configuration for Druid
From the Hadoop container's shell, run the following command to copy the
Hadoop .xml configuration files to the shared folder:
@@ -151,13 +132,14 @@ cp /usr/local/hadoop/etc/hadoop/*.xml /shared/hadoop_xml
From the host machine, run the following, where `PATH_TO_DRUID` is replaced by
the path to the Druid package.
```bash
-mkdir -p
PATH_TO_DRUID/conf/druid/single-server/micro-quickstart/_common/hadoop-xml
-cp /tmp/shared/hadoop_xml/*.xml
PATH_TO_DRUID/conf/druid/single-server/micro-quickstart/_common/hadoop-xml/
+cd $PATH_TO_DRUID
+mkdir -p conf/druid/single-server/micro-quickstart/_common/hadoop-xml
+tar xzf /tmp/shared/hadoop-conf.tgz -C
conf/druid/single-server/micro-quickstart/_common/hadoop-xml
```
### Update Druid segment and log storage
-In your favorite text editor, open
`conf/druid/auto/_common/common.runtime.properties`, and make the following
edits:
+In your favorite text editor, open
`conf/druid/single-server/micro-quickstart/_common/common.runtime.properties`,
and make the following edits:
#### Disable local deep storage and enable HDFS deep storage
@@ -197,7 +179,10 @@ druid.indexer.logs.directory=/druid/indexing-logs
Once the Hadoop .xml files have been copied to the Druid cluster and the
segment/log storage configuration has been updated to use HDFS, the Druid
cluster needs to be restarted for the new configurations to take effect.
-If the cluster is still running, CTRL-C to terminate the `bin/start-druid`
script, and re-run it to bring the Druid services back up.
+If the cluster is still running, CTRL-C to terminate it if running - and start
it with:
+```
+bin/start-druid -c conf/druid/single-server/micro-quickstart
+```
## Load batch data
@@ -222,7 +207,7 @@ This tutorial is only meant to be used together with the
[query tutorial](../tut
If you wish to go through any of the other tutorials, you will need to:
* Shut down the cluster and reset the cluster state by removing the contents
of the `var` directory under the druid package.
-* Revert the deep storage and task storage config back to local types in
`conf/druid/auto/_common/common.runtime.properties`
+* Revert the deep storage and task storage config back to local types in
`conf/druid/single-server/micro-quickstart/_common/common.runtime.properties`
* Restart the cluster
This is necessary because the other ingestion tutorials will write to the same
"wikipedia" datasource, and later tutorials expect the cluster to use local
deep storage.
diff --git a/examples/quickstart/tutorial/hadoop/docker/Dockerfile
b/examples/quickstart/tutorial/hadoop/docker/Dockerfile
index dd527781469..1a0ecab589d 100644
--- a/examples/quickstart/tutorial/hadoop/docker/Dockerfile
+++ b/examples/quickstart/tutorial/hadoop/docker/Dockerfile
@@ -53,26 +53,29 @@ RUN rpm --import
http://repos.azulsystems.com/RPM-GPG-KEY-azulsystems && \
yum -q -y update && \
yum -q -y upgrade && \
yum -q -y install zulu17-jdk && \
+ yum -q -y install nano net-tools telnet less unzip wget && \
yum clean all && \
rm -rf /var/cache/yum zulu-repo_${ZULU_REPO_VER}.noarch.rpm
ENV JAVA_HOME=/usr/lib/jvm/zulu17
-ENV PATH $PATH:$JAVA_HOME/bin
+ENV PATH=$PATH:$JAVA_HOME/bin
# hadoop
ARG APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org
RUN curl -s
${APACHE_ARCHIVE_MIRROR_HOST}/hadoop/core/hadoop-3.3.6/hadoop-3.3.6.tar.gz |
tar -xz -C /usr/local/
RUN cd /usr/local && ln -s ./hadoop-3.3.6 hadoop
-ENV HADOOP_HOME /usr/local/hadoop
-ENV HADOOP_COMMON_HOME /usr/local/hadoop
-ENV HADOOP_HDFS_HOME /usr/local/hadoop
-ENV HADOOP_MAPRED_HOME /usr/local/hadoop
-ENV HADOOP_YARN_HOME /usr/local/hadoop
-ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop
-ENV YARN_CONF_DIR $HADOOP_HOME/etc/hadoop
+ENV HADOOP_HOME=/usr/local/hadoop
+ENV HADOOP_COMMON_HOME=/usr/local/hadoop
+ENV HADOOP_HDFS_HOME=/usr/local/hadoop
+ENV HADOOP_MAPRED_HOME=/usr/local/hadoop
+ENV HADOOP_YARN_HOME=/usr/local/hadoop
+ENV HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
+ENV YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
# in hadoop 3 the example file is nearly empty so we can just append stuff
+RUN cat << EOT >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
+
RUN sed -i '$ a export JAVA_HOME=/usr/lib/jvm/zulu17'
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
RUN sed -i '$ a export HADOOP_HOME=/usr/local/hadoop'
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
RUN sed -i '$ a export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/'
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
@@ -81,8 +84,7 @@ RUN sed -i '$ a export HDFS_DATANODE_USER=root'
$HADOOP_HOME/etc/hadoop/hadoop-e
RUN sed -i '$ a export HDFS_SECONDARYNAMENODE_USER=root'
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
RUN sed -i '$ a export YARN_RESOURCEMANAGER_USER=root'
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
RUN sed -i '$ a export YARN_NODEMANAGER_USER=root'
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
-
-RUN cat $HADOOP_HOME/etc/hadoop/hadoop-env.sh
+RUN sed -i '$ a export YARN_OPTS+="
--add-opens=java.base/java.lang=ALL-UNNAMED"'
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
RUN mkdir $HADOOP_HOME/input
RUN cp $HADOOP_HOME/etc/hadoop/*.xml $HADOOP_HOME/input
@@ -108,11 +110,14 @@ RUN chown root:root /root/.ssh/config
#
# ADD supervisord.conf /etc/supervisord.conf
+RUN wget -nv
https://github.com/bitnami/wait-for-port/releases/download/v1.0/wait-for-port.zip
&& unzip wait-for-port.zip && mv wait-for-port /usr/bin && rm wait-for-port.zip
+RUN wget -nv
https://github.com/apache/druid/raw/refs/heads/33.0.0/examples/quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz
+
ADD bootstrap.sh /etc/bootstrap.sh
RUN chown root:root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh
-ENV BOOTSTRAP /etc/bootstrap.sh
+ENV BOOTSTRAP=/etc/bootstrap.sh
# workingaround docker.io build error
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh
@@ -145,4 +150,4 @@ EXPOSE 10020 19888
#Yarn ports
EXPOSE 8030 8031 8032 8033 8040 8042 8088
#Other ports
-EXPOSE 2122 49707
\ No newline at end of file
+EXPOSE 2122 49707
diff --git a/examples/quickstart/tutorial/hadoop/docker/bootstrap.sh
b/examples/quickstart/tutorial/hadoop/docker/bootstrap.sh
index d1fa493d4ea..4bf3ba63bef 100644
--- a/examples/quickstart/tutorial/hadoop/docker/bootstrap.sh
+++ b/examples/quickstart/tutorial/hadoop/docker/bootstrap.sh
@@ -27,11 +27,25 @@ cd $HADOOP_HOME/share/hadoop/common ; for cp in ${ACP//,/
}; do echo == $cp; cu
sed s/HOSTNAME/$HOSTNAME/ /usr/local/hadoop/etc/hadoop/core-site.xml.template
> /usr/local/hadoop/etc/hadoop/core-site.xml
+export PATH+=:$HADOOP_HOME/bin
+export PATH+=:$HADOOP_HOME/sbin
+
start_sshd
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
+
+if [ ! -e .inited ] ; then
+ echo " * initialize hdfs for first run"
+ wait-for-port 9000
+ hdfs dfs -mkdir -p /druid/segments /quickstart /user
+ hdfs dfs -chmod -R 777 /druid /quickstart /user /tmp
+ hdfs dfs -put wikiticker-2015-09-12-sampled.json.gz
/quickstart/wikiticker-2015-09-12-sampled.json.gz
+ tar -C $HADOOP_HOME/etc/hadoop -czf /shared/hadoop-conf.tgz .
+ touch .inited
+fi
+
if [[ $1 == "-d" ]]; then
while true; do sleep 1000; done
fi
diff --git a/examples/quickstart/tutorial/wikipedia-index-hadoop3.json
b/examples/quickstart/tutorial/wikipedia-index-hadoop3.json
index 28db64e3268..30b60ffbb15 100644
--- a/examples/quickstart/tutorial/wikipedia-index-hadoop3.json
+++ b/examples/quickstart/tutorial/wikipedia-index-hadoop3.json
@@ -72,9 +72,10 @@
"mapreduce.reduce.java.opts" : "-Duser.timezone=UTC
-Dfile.encoding=UTF-8",
"mapreduce.map.memory.mb" : 1024,
"mapreduce.reduce.memory.mb" : 1024,
+ "yarn.app.mapreduce.am.command-opts" :
"--add-opens=java.base/java.lang=ALL-UNNAMED",
"mapreduce.job.classloader" : "true"
}
}
},
- "hadoopDependencyCoordinates":
["org.apache.hadoop:hadoop-client-api:3.3.1","org.apache.hadoop:hadoop-client-runtime:3.3.1"]
+ "hadoopDependencyCoordinates":
["org.apache.hadoop:hadoop-client-api:3.3.6","org.apache.hadoop:hadoop-client-runtime:3.3.6"]
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]