(druid) branch master updated: Fix hadoop3 ingestion example (#17921)

karan Tue, 15 Apr 2025 08:32:25 -0700

This is an automated email from the ASF dual-hosted git repository.

karan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new 2396ca88189 Fix hadoop3 ingestion example (#17921)
2396ca88189 is described below

commit 2396ca881899cf4fc1f8311d7090ce8948def0d8
Author: Zoltan Haindrich <[email protected]>
AuthorDate: Tue Apr 15 17:32:15 2025 +0200

    Fix hadoop3 ingestion example (#17921)
    
    * updates to docs/dockerfile/etc
    
    * spellfix
---
 docs/tutorials/tutorial-batch-hadoop.md            | 75 +++++++++-------------
 .../quickstart/tutorial/hadoop/docker/Dockerfile   | 29 +++++----
 .../quickstart/tutorial/hadoop/docker/bootstrap.sh | 14 ++++
 .../tutorial/wikipedia-index-hadoop3.json          |  3 +-
 4 files changed, 63 insertions(+), 58 deletions(-)

diff --git a/docs/tutorials/tutorial-batch-hadoop.md 
b/docs/tutorials/tutorial-batch-hadoop.md
index a0575036004..a71823544af 100644
--- a/docs/tutorials/tutorial-batch-hadoop.md
+++ b/docs/tutorials/tutorial-batch-hadoop.md
@@ -61,7 +61,6 @@ Let's create some folders under `/tmp`, we will use these 
later when starting th
 
 ```bash
 mkdir -p /tmp/shared
-mkdir -p /tmp/shared/hadoop_xml
 ```
 
 ### Configure /etc/hosts
@@ -77,24 +76,27 @@ On the host machine, add the following entry to 
`/etc/hosts`:
 Once the `/tmp/shared` folder has been created and the `etc/hosts` entry has 
been added, run the following command to start the Hadoop container.
 
 ```bash
-docker run -it  -h druid-hadoop-demo --name druid-hadoop-demo -p 2049:2049 -p 
2122:2122 -p 8020:8020 -p 8021:8021 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 
8033:8033 -p 8040:8040 -p 8042:8042 -p 8088:8088 -p 8443:8443 -p 9000:9000 -p 
10020:10020 -p 19888:19888 -p 34455:34455 -p 49707:49707 -p 50010:50010 -p 
50020:50020 -p 50030:50030 -p 50060:50060 -p 50070:50070 -p 50075:50075 -p 
50090:50090 -p 51111:51111 -v /tmp/shared:/shared druid-hadoop-demo:3.3.6 
/etc/bootstrap.sh -bash
+docker run -it  -h druid-hadoop-demo --name druid-hadoop-demo -p 2049:2049 -p 
2122:2122 -p 8020-8042:8020-8042 -p 8088:8088 -p 8443:8443 -p 9000:9000 -p 
9820:9820 -p 9860-9880:9860-9880 -p 10020:10020 -p 19888:19888 -p 34455:34455 
-p 49707:49707 -p 50010:50010 -p 50020:50020 -p 50030:50030 -p 50060:50060 -p 
50070:50070 -p 50075:50075 -p 50090:50090 -p 51111:51111 -v /tmp/shared:/shared 
druid-hadoop-demo:3.3.6 /etc/bootstrap.sh -bash
 ```
 
 Once the container is started, your terminal will attach to a bash shell 
running inside the container:
 
 ```bash
-Starting sshd:                                             [  OK  ]
-18/07/26 17:27:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
 Starting namenodes on [druid-hadoop-demo]
-druid-hadoop-demo: starting namenode, logging to 
/usr/local/hadoop/logs/hadoop-root-namenode-druid-hadoop-demo.out
-localhost: starting datanode, logging to 
/usr/local/hadoop/logs/hadoop-root-datanode-druid-hadoop-demo.out
-Starting secondary namenodes [0.0.0.0]
-0.0.0.0: starting secondarynamenode, logging to 
/usr/local/hadoop/logs/hadoop-root-secondarynamenode-druid-hadoop-demo.out
-18/07/26 17:27:31 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
-starting yarn daemons
-starting resourcemanager, logging to 
/usr/local/hadoop/logs/yarn--resourcemanager-druid-hadoop-demo.out
-localhost: starting nodemanager, logging to 
/usr/local/hadoop/logs/yarn-root-nodemanager-druid-hadoop-demo.out
-starting historyserver, logging to 
/usr/local/hadoop/logs/mapred--historyserver-druid-hadoop-demo.out
+Starting datanodes
+Starting secondary namenodes [druid-hadoop-demo]
+WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of 
YARN_CONF_DIR.
+WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
+Starting resourcemanager
+WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
+Starting nodemanagers
+WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
+localhost: WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of 
YARN_OPTS.
+WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of 
YARN_CONF_DIR.
+WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
+WARNING: Use of this script to start the MR JobHistory daemon is deprecated.
+WARNING: Attempting to execute replacement "mapred --daemon start" instead.
+ * initialize hdfs for first run
 bash-4.1#
 ```
 
@@ -108,39 +110,18 @@ To open another shell to the Hadoop container, run the 
following command:
 docker exec -it druid-hadoop-demo bash
 ```
 
-### Copy input data to the Hadoop container
+### Test data
 
-From the `apache-druid-{{DRUIDVERSION}}` package root on the host, copy the 
`quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` sample data to the 
shared folder:
-
-```bash
-cp quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz 
/tmp/shared/wikiticker-2015-09-12-sampled.json.gz
-```
-
-### Setup HDFS directories
-
-In the Hadoop container's shell, run the following commands to setup the HDFS 
directories needed by this tutorial and copy the input data to HDFS.
-
-```bash
-cd /usr/local/hadoop/bin
-./hdfs dfs -mkdir /druid
-./hdfs dfs -mkdir /druid/segments
-./hdfs dfs -mkdir /quickstart
-./hdfs dfs -mkdir /user
-./hdfs dfs -chmod 777 /druid
-./hdfs dfs -chmod 777 /druid/segments
-./hdfs dfs -chmod 777 /quickstart
-./hdfs dfs -chmod -R 777 /tmp
-./hdfs dfs -chmod -R 777 /user
-./hdfs dfs -put /shared/wikiticker-2015-09-12-sampled.json.gz 
/quickstart/wikiticker-2015-09-12-sampled.json.gz
-```
-
-If you encounter namenode errors when running this command, the Hadoop 
container may not be finished initializing. When this occurs, wait a couple of 
minutes and retry the commands.
+The startup script `bootstrap.sh`:
+* creates the necessary directories
+* loads an input file to HDFS
+* places the hadoop configuration into the shared volume as `hadoop-conf.tgz`
 
 ## Configure Druid to use Hadoop
 
 Some additional steps are needed to configure the Druid cluster for Hadoop 
batch indexing.
 
-### Copy Hadoop configuration to Druid classpath
+### Provide Hadoop configuration for Druid
 
 From the Hadoop container's shell, run the following command to copy the 
Hadoop .xml configuration files to the shared folder:
 
@@ -151,13 +132,14 @@ cp /usr/local/hadoop/etc/hadoop/*.xml /shared/hadoop_xml
 From the host machine, run the following, where `PATH_TO_DRUID` is replaced by 
the path to the Druid package.
 
 ```bash
-mkdir -p 
PATH_TO_DRUID/conf/druid/single-server/micro-quickstart/_common/hadoop-xml
-cp /tmp/shared/hadoop_xml/*.xml 
PATH_TO_DRUID/conf/druid/single-server/micro-quickstart/_common/hadoop-xml/
+cd $PATH_TO_DRUID
+mkdir -p conf/druid/single-server/micro-quickstart/_common/hadoop-xml
+tar xzf /tmp/shared/hadoop-conf.tgz -C 
conf/druid/single-server/micro-quickstart/_common/hadoop-xml
 ```
 
 ### Update Druid segment and log storage
 
-In your favorite text editor, open 
`conf/druid/auto/_common/common.runtime.properties`, and make the following 
edits:
+In your favorite text editor, open 
`conf/druid/single-server/micro-quickstart/_common/common.runtime.properties`, 
and make the following edits:
 
 #### Disable local deep storage and enable HDFS deep storage
 
@@ -197,7 +179,10 @@ druid.indexer.logs.directory=/druid/indexing-logs
 
 Once the Hadoop .xml files have been copied to the Druid cluster and the 
segment/log storage configuration has been updated to use HDFS, the Druid 
cluster needs to be restarted for the new configurations to take effect.
 
-If the cluster is still running, CTRL-C to terminate the `bin/start-druid` 
script, and re-run it to bring the Druid services back up.
+If the cluster is still running, CTRL-C to terminate it if running - and start 
it with:
+```
+bin/start-druid -c conf/druid/single-server/micro-quickstart
+```
 
 ## Load batch data
 
@@ -222,7 +207,7 @@ This tutorial is only meant to be used together with the 
[query tutorial](../tut
 
 If you wish to go through any of the other tutorials, you will need to:
 * Shut down the cluster and reset the cluster state by removing the contents 
of the `var` directory under the druid package.
-* Revert the deep storage and task storage config back to local types in 
`conf/druid/auto/_common/common.runtime.properties`
+* Revert the deep storage and task storage config back to local types in 
`conf/druid/single-server/micro-quickstart/_common/common.runtime.properties`
 * Restart the cluster
 
 This is necessary because the other ingestion tutorials will write to the same 
"wikipedia" datasource, and later tutorials expect the cluster to use local 
deep storage.
diff --git a/examples/quickstart/tutorial/hadoop/docker/Dockerfile 
b/examples/quickstart/tutorial/hadoop/docker/Dockerfile
index dd527781469..1a0ecab589d 100644
--- a/examples/quickstart/tutorial/hadoop/docker/Dockerfile
+++ b/examples/quickstart/tutorial/hadoop/docker/Dockerfile
@@ -53,26 +53,29 @@ RUN rpm --import 
http://repos.azulsystems.com/RPM-GPG-KEY-azulsystems && \
     yum -q -y update && \
     yum -q -y upgrade && \
     yum -q -y install zulu17-jdk && \
+    yum -q -y install nano net-tools telnet less unzip wget && \
     yum clean all && \
     rm -rf /var/cache/yum zulu-repo_${ZULU_REPO_VER}.noarch.rpm
 
 ENV JAVA_HOME=/usr/lib/jvm/zulu17
-ENV PATH $PATH:$JAVA_HOME/bin
+ENV PATH=$PATH:$JAVA_HOME/bin
 
 # hadoop
 ARG APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org
 RUN curl -s 
${APACHE_ARCHIVE_MIRROR_HOST}/hadoop/core/hadoop-3.3.6/hadoop-3.3.6.tar.gz | 
tar -xz -C /usr/local/
 RUN cd /usr/local && ln -s ./hadoop-3.3.6 hadoop
 
-ENV HADOOP_HOME /usr/local/hadoop
-ENV HADOOP_COMMON_HOME /usr/local/hadoop
-ENV HADOOP_HDFS_HOME /usr/local/hadoop
-ENV HADOOP_MAPRED_HOME /usr/local/hadoop
-ENV HADOOP_YARN_HOME /usr/local/hadoop
-ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop
-ENV YARN_CONF_DIR $HADOOP_HOME/etc/hadoop
+ENV HADOOP_HOME=/usr/local/hadoop
+ENV HADOOP_COMMON_HOME=/usr/local/hadoop
+ENV HADOOP_HDFS_HOME=/usr/local/hadoop
+ENV HADOOP_MAPRED_HOME=/usr/local/hadoop
+ENV HADOOP_YARN_HOME=/usr/local/hadoop
+ENV HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
+ENV YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
 
 # in hadoop 3 the example file is nearly empty so we can just append stuff
+RUN cat << EOT >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
+
 RUN sed -i '$ a export JAVA_HOME=/usr/lib/jvm/zulu17' 
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
 RUN sed -i '$ a export HADOOP_HOME=/usr/local/hadoop' 
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
 RUN sed -i '$ a export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/' 
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
@@ -81,8 +84,7 @@ RUN sed -i '$ a export HDFS_DATANODE_USER=root' 
$HADOOP_HOME/etc/hadoop/hadoop-e
 RUN sed -i '$ a export HDFS_SECONDARYNAMENODE_USER=root' 
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
 RUN sed -i '$ a export YARN_RESOURCEMANAGER_USER=root' 
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
 RUN sed -i '$ a export YARN_NODEMANAGER_USER=root' 
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
-
-RUN cat $HADOOP_HOME/etc/hadoop/hadoop-env.sh
+RUN sed -i '$ a export YARN_OPTS+="  
--add-opens=java.base/java.lang=ALL-UNNAMED"' 
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
 
 RUN mkdir $HADOOP_HOME/input
 RUN cp $HADOOP_HOME/etc/hadoop/*.xml $HADOOP_HOME/input
@@ -108,11 +110,14 @@ RUN chown root:root /root/.ssh/config
 #
 # ADD supervisord.conf /etc/supervisord.conf
 
+RUN wget -nv 
https://github.com/bitnami/wait-for-port/releases/download/v1.0/wait-for-port.zip
 && unzip wait-for-port.zip && mv wait-for-port /usr/bin && rm wait-for-port.zip
+RUN wget -nv 
https://github.com/apache/druid/raw/refs/heads/33.0.0/examples/quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz
+
 ADD bootstrap.sh /etc/bootstrap.sh
 RUN chown root:root /etc/bootstrap.sh
 RUN chmod 700 /etc/bootstrap.sh
 
-ENV BOOTSTRAP /etc/bootstrap.sh
+ENV BOOTSTRAP=/etc/bootstrap.sh
 
 # workingaround docker.io build error
 RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh
@@ -145,4 +150,4 @@ EXPOSE 10020 19888
 #Yarn ports
 EXPOSE 8030 8031 8032 8033 8040 8042 8088
 #Other ports
-EXPOSE 2122 49707
\ No newline at end of file
+EXPOSE 2122 49707
diff --git a/examples/quickstart/tutorial/hadoop/docker/bootstrap.sh 
b/examples/quickstart/tutorial/hadoop/docker/bootstrap.sh
index d1fa493d4ea..4bf3ba63bef 100644
--- a/examples/quickstart/tutorial/hadoop/docker/bootstrap.sh
+++ b/examples/quickstart/tutorial/hadoop/docker/bootstrap.sh
@@ -27,11 +27,25 @@ cd $HADOOP_HOME/share/hadoop/common ; for cp in ${ACP//,/ 
}; do  echo == $cp; cu
 sed s/HOSTNAME/$HOSTNAME/ /usr/local/hadoop/etc/hadoop/core-site.xml.template 
> /usr/local/hadoop/etc/hadoop/core-site.xml
 
 
+export PATH+=:$HADOOP_HOME/bin
+export PATH+=:$HADOOP_HOME/sbin
+
 start_sshd
 $HADOOP_HOME/sbin/start-dfs.sh
 $HADOOP_HOME/sbin/start-yarn.sh
 $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
 
+
+if [ ! -e .inited ] ; then
+       echo " * initialize hdfs for first run"
+       wait-for-port 9000
+       hdfs dfs -mkdir -p /druid/segments /quickstart /user
+       hdfs dfs -chmod -R 777 /druid /quickstart /user /tmp
+       hdfs dfs -put wikiticker-2015-09-12-sampled.json.gz 
/quickstart/wikiticker-2015-09-12-sampled.json.gz
+       tar -C $HADOOP_HOME/etc/hadoop -czf /shared/hadoop-conf.tgz .
+       touch .inited
+fi
+
 if [[ $1 == "-d" ]]; then
   while true; do sleep 1000; done
 fi
diff --git a/examples/quickstart/tutorial/wikipedia-index-hadoop3.json 
b/examples/quickstart/tutorial/wikipedia-index-hadoop3.json
index 28db64e3268..30b60ffbb15 100644
--- a/examples/quickstart/tutorial/wikipedia-index-hadoop3.json
+++ b/examples/quickstart/tutorial/wikipedia-index-hadoop3.json
@@ -72,9 +72,10 @@
         "mapreduce.reduce.java.opts" : "-Duser.timezone=UTC 
-Dfile.encoding=UTF-8",
         "mapreduce.map.memory.mb" : 1024,
         "mapreduce.reduce.memory.mb" : 1024,
+        "yarn.app.mapreduce.am.command-opts" : 
"--add-opens=java.base/java.lang=ALL-UNNAMED",
         "mapreduce.job.classloader" : "true"
       }
     }
   },
-  "hadoopDependencyCoordinates": 
["org.apache.hadoop:hadoop-client-api:3.3.1","org.apache.hadoop:hadoop-client-runtime:3.3.1"]
+  "hadoopDependencyCoordinates": 
["org.apache.hadoop:hadoop-client-api:3.3.6","org.apache.hadoop:hadoop-client-runtime:3.3.6"]
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(druid) branch master updated: Fix hadoop3 ingestion example (#17921)

Reply via email to