IMPALA-6070: Parallel data load.

This commit loads functional-query, TPC-H data, and TPC-DS data in
parallel. In parallel, these take about 37 minutes, dominated by
functional-query. Serially, these take about 30 minutes more, namely the
13 minutes of tpcds and 16 minutes of tpcds. This works out nicely
because CPU usage during data load is very low in aggregate. (We don't
sustain more than 1 CPU of load, whereas build machines are likely to
have many CPUs.)

To do this, I added support to run-step.sh to have a notion of a
backgroundable task, and support waiting for all tasks.

I also increased the heapsize of our HiveServer2 server. When datasets
were being loaded in parallel, we ran out of memory at 256MB of heap.

The resulting log output is currently like so (but without the
timestamps):

15:58:04  Started Loading functional-query data in background; pid 8105.
15:58:04  Started Loading TPC-H data in background; pid 8106.
15:58:04  Loading functional-query data (logging to 
/home/impdev/Impala/logs/data_loading/load-functional-query.log)...
15:58:04  Started Loading TPC-DS data in background; pid 8107.
15:58:04  Loading TPC-H data (logging to 
/home/impdev/Impala/logs/data_loading/load-tpch.log)...
15:58:04  Loading TPC-DS data (logging to 
/home/impdev/Impala/logs/data_loading/load-tpcds.log)...
16:11:31    Loading workload 'tpch' using exploration strategy 'core' OK (Took: 
13 min 27 sec)
16:14:33    Loading workload 'tpcds' using exploration strategy 'core' OK 
(Took: 16 min 29 sec)
16:35:08    Loading workload 'functional-query' using exploration strategy 
'exhaustive' OK (Took: 37 min 4 sec)

I tested dataloading with the following command on an 8-core, 32GB
machine. I saw 19GB of available memory during my run:
  ./buildall.sh -testdata -build_shared_libs -start_minicluster 
-start_impala_cluster -format

Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Reviewed-on: http://gerrit.cloudera.org:8080/8320
Reviewed-by: Jim Apple <[email protected]>
Reviewed-by: Michael Brown <[email protected]>
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/e020c371
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/e020c371
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/e020c371

Branch: refs/heads/master
Commit: e020c37106383be5416f882cbe11fc25efad8968
Parents: 77e010a
Author: Philip Zeyliger <[email protected]>
Authored: Tue Oct 17 17:10:49 2017 -0700
Committer: Impala Public Jenkins <[email protected]>
Committed: Wed Oct 25 00:00:25 2017 +0000

----------------------------------------------------------------------
 testdata/bin/create-load-data.sh | 11 ++++++++---
 testdata/bin/run-hive-server.sh  |  2 +-
 testdata/bin/run-step.sh         | 36 ++++++++++++++++++++++++++++++++++-
 3 files changed, 44 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e020c371/testdata/bin/create-load-data.sh
----------------------------------------------------------------------
diff --git a/testdata/bin/create-load-data.sh b/testdata/bin/create-load-data.sh
index 97787c2..099fe59 100755
--- a/testdata/bin/create-load-data.sh
+++ b/testdata/bin/create-load-data.sh
@@ -477,9 +477,15 @@ fi
 
 if [ $SKIP_METADATA_LOAD -eq 0 ]; then
   run-step "Loading custom schemas" load-custom-schemas.log load-custom-schemas
-  run-step "Loading functional-query data" load-functional-query.log \
+  # Run some steps in parallel, with run-step-backgroundable / 
run-step-wait-all.
+  # This is effective on steps that take a long time and don't depend on each
+  # other. Functional-query takes about ~35 minutes, and TPC-H and TPC-DS can
+  # finish while functional-query is running.
+  run-step-backgroundable "Loading functional-query data" 
load-functional-query.log \
       load-data "functional-query" "exhaustive"
-  run-step "Loading TPC-H data" load-tpch.log load-data "tpch" "core"
+  run-step-backgroundable "Loading TPC-H data" load-tpch.log load-data "tpch" 
"core"
+  run-step-backgroundable "Loading TPC-DS data" load-tpcds.log load-data 
"tpcds" "core"
+  run-step-wait-all
   # Load tpch nested data.
   # TODO: Hacky and introduces more complexity into the system, but it is 
expedient.
   if [[ -n "$CM_HOST" ]]; then
@@ -487,7 +493,6 @@ if [ $SKIP_METADATA_LOAD -eq 0 ]; then
   fi
   run-step "Loading nested data" load-nested.log \
     ${IMPALA_HOME}/testdata/bin/load_nested.py ${LOAD_NESTED_ARGS:-}
-  run-step "Loading TPC-DS data" load-tpcds.log load-data "tpcds" "core"
   run-step "Loading auxiliary workloads" load-aux-workloads.log 
load-aux-workloads
   run-step "Loading dependent tables" copy-and-load-dependent-tables.log \
       copy-and-load-dependent-tables

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e020c371/testdata/bin/run-hive-server.sh
----------------------------------------------------------------------
diff --git a/testdata/bin/run-hive-server.sh b/testdata/bin/run-hive-server.sh
index 530b804..42d95b5 100755
--- a/testdata/bin/run-hive-server.sh
+++ b/testdata/bin/run-hive-server.sh
@@ -72,7 +72,7 @@ ${CLUSTER_BIN}/wait-for-metastore.py 
--transport=${METASTORE_TRANSPORT}
 if [ ${ONLY_METASTORE} -eq 0 ]; then
   # Starts a HiveServer2 instance on the port specified by the 
HIVE_SERVER2_THRIFT_PORT
   # environment variable.
-  hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 &
+  HADOOP_HEAPSIZE="512" hive --service hiveserver2 > 
${LOGDIR}/hive-server2.out 2>&1 &
 
   # Wait for the HiveServer2 service to come up because callers of this script
   # may rely on it being available.

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e020c371/testdata/bin/run-step.sh
----------------------------------------------------------------------
diff --git a/testdata/bin/run-step.sh b/testdata/bin/run-step.sh
index 45c5774..9943013 100755
--- a/testdata/bin/run-step.sh
+++ b/testdata/bin/run-step.sh
@@ -48,5 +48,39 @@ function run-step {
     return 1
   fi
   ELAPSED_TIME=$(($SECONDS - $START_TIME))
-  echo "    OK (Took: $(($ELAPSED_TIME/60)) min $(($ELAPSED_TIME%60)) sec)"
+  echo "  ${MSG} OK (Took: $(($ELAPSED_TIME/60)) min $(($ELAPSED_TIME%60)) 
sec)"
+}
+
+# Array to manage background tasks.
+declare -a RUN_STEP_PIDS
+declare -a RUN_STEP_MSGS
+
+# Runs the given step in the background. Many tasks may be started in the
+# background, and all of them must be joined together with run-step-wait-all.
+# No dependency management or maximums on number of tasks are provided.
+function run-step-backgroundable {
+  MSG="$1"
+  run-step "$@" &
+  local pid=$!
+  echo "Started ${MSG} in background; pid $pid."
+  RUN_STEP_PIDS+=($pid)
+  RUN_STEP_MSGS+=("${MSG}")
+}
+
+# Wait for all tasks that were run with run-step-backgroundable.
+# Fails if any of the background tasks has failed. Clears $RUN_STEP_PIDS.
+function run-step-wait-all {
+  local ret=0
+  for idx in "${!RUN_STEP_PIDS[@]}"; do
+    pid="${RUN_STEP_PIDS[$idx]}"
+    msg="${RUN_STEP_MSGS[$idx]}"
+
+    if ! wait $pid; then
+      ret=1
+      echo "Background task $msg (pid $pid) failed."
+    fi
+  done
+  RUN_STEP_PIDS=()
+  RUN_STEP_MSGS=()
+  return $ret
 }

Reply via email to