impala git commit: IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2

joemcdonnell Thu, 10 May 2018 14:49:11 -0700

Repository: impala
Updated Branches:
  refs/heads/2.x 59ccbbd70 -> 3bfee642d



IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2

There is a Hive bug in Hive 1.1.0 that can result
in a NullPointerException when doing parallel Hive
operations (see IMPALA-6532). Since dataload goes
parallel on Hive loads starting with IMPALA-6372,
dataload can hit this error on Hive 1.1.0 (i.e.
IMPALA_MINICLUSTER_PROFILE=2). This is impacting
builds on the 2.x branch.

This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2.

IMPALA_MINICLUSTER_PROFILE=3 uses a newer version
of Hive that has a fix for this, so this continues
to use parallel dataload for that case.

Parallelism can be reenabled when Hive 1.1.0 gets the
fix from Hive 2.1.1.

Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6
Reviewed-on: http://gerrit.cloudera.org:8080/10306
Reviewed-by: Philip Zeyliger <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-on: http://gerrit.cloudera.org:8080/10367


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/3bfee642
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/3bfee642
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/3bfee642

Branch: refs/heads/2.x
Commit: 3bfee642dd8a8134c173a2b622ac2b265d309027
Parents: 59ccbbd
Author: Joe McDonnell <[email protected]>
Authored: Thu May 3 16:56:05 2018 -0700
Committer: Impala Public Jenkins <[email protected]>
Committed: Thu May 10 21:48:11 2018 +0000

----------------------------------------------------------------------
 bin/impala-config.sh             | 5 +++++
 bin/load-data.py                 | 2 +-
 testdata/bin/create-load-data.sh | 6 ++++++
 3 files changed, 12 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/3bfee642/bin/impala-config.sh
----------------------------------------------------------------------
diff --git a/bin/impala-config.sh b/bin/impala-config.sh
index 6dff3d7..e3bd591 100755
--- a/bin/impala-config.sh
+++ b/bin/impala-config.sh
@@ -177,6 +177,11 @@ export KUDU_JAVA_VERSION=1.8.0-cdh5.16.0-SNAPSHOT
 # ------------------------------------------
 export CDH_MAJOR_VERSION=5
 export IMPALA_HADOOP_VERSION=2.6.0-cdh5.16.0-SNAPSHOT
+export IMPALA_MINICLUSTER_PROFILE=2
+# IMPALA-6972: Temporarily disable Hive parallelism during dataload
+# The Hive version used for IMPALA_MINICLUSTER_PROFIILE=2 has a concurrency 
issue
+# that intermittent fails parallel dataload.
+export IMPALA_SERIAL_DATALOAD=1
 unset IMPALA_HADOOP_URL
 export IMPALA_HBASE_VERSION=1.2.0-cdh5.16.0-SNAPSHOT
 unset IMPALA_HBASE_URL

http://git-wip-us.apache.org/repos/asf/impala/blob/3bfee642/bin/load-data.py
----------------------------------------------------------------------
diff --git a/bin/load-data.py b/bin/load-data.py
index 28a504f..2b9e05c 100755
--- a/bin/load-data.py
+++ b/bin/load-data.py
@@ -78,7 +78,7 @@ parser.add_option("--use_kerberos", action="store_true", 
default=False,
                   help="Load data on a kerberized cluster.")
 parser.add_option("--principal", default=None, dest="principal",
                   help="Kerberos service principal, required if --use_kerberos 
is set")
-parser.add_option("--num_processes", default=multiprocessing.cpu_count(),
+parser.add_option("--num_processes", type="int", 
default=multiprocessing.cpu_count(),
                   dest="num_processes", help="Number of parallel processes to 
use.")
 
 options, args = parser.parse_args()

http://git-wip-us.apache.org/repos/asf/impala/blob/3bfee642/testdata/bin/create-load-data.sh
----------------------------------------------------------------------
diff --git a/testdata/bin/create-load-data.sh b/testdata/bin/create-load-data.sh
index fcb7e69..a97a2e2 100755
--- a/testdata/bin/create-load-data.sh
+++ b/testdata/bin/create-load-data.sh
@@ -41,6 +41,7 @@ trap 'echo Error in $0 at line $LINENO: $(cd "'$PWD'" && awk 
"NR == $LINENO" $0)
 : ${IMPALAD=localhost:21000}
 : ${REMOTE_LOAD=}
 : ${CM_HOST=}
+: ${IMPALA_SERIAL_DATALOAD=}
 
 SKIP_METADATA_LOAD=0
 SKIP_SNAPSHOT_LOAD=0
@@ -208,6 +209,11 @@ function load-data {
   ARGS+=("--hive_hs2_hostport ${HS2_HOST_PORT}")
   ARGS+=("--hdfs_namenode ${HDFS_NN}")
 
+  # Disable parallelism for dataload if IMPALA_SERIAL_DATALOAD is set
+  if [[ "${IMPALA_SERIAL_DATALOAD}" -eq 1 ]]; then
+    ARGS+=("--num_processes 1")
+  fi
+
   if [[ -n ${TABLE_FORMATS} ]]; then
     # TBL_FMT_STR replaces slashes with underscores,
     # e.g., kudu/none/none -> kudu_none_none

impala git commit: IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2

Reply via email to