Repository: impala Updated Branches: refs/heads/2.x 59ccbbd70 -> 3bfee642d
IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2 There is a Hive bug in Hive 1.1.0 that can result in a NullPointerException when doing parallel Hive operations (see IMPALA-6532). Since dataload goes parallel on Hive loads starting with IMPALA-6372, dataload can hit this error on Hive 1.1.0 (i.e. IMPALA_MINICLUSTER_PROFILE=2). This is impacting builds on the 2.x branch. This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2. IMPALA_MINICLUSTER_PROFILE=3 uses a newer version of Hive that has a fix for this, so this continues to use parallel dataload for that case. Parallelism can be reenabled when Hive 1.1.0 gets the fix from Hive 2.1.1. Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6 Reviewed-on: http://gerrit.cloudera.org:8080/10306 Reviewed-by: Philip Zeyliger <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> Reviewed-on: http://gerrit.cloudera.org:8080/10367 Project: http://git-wip-us.apache.org/repos/asf/impala/repo Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/3bfee642 Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/3bfee642 Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/3bfee642 Branch: refs/heads/2.x Commit: 3bfee642dd8a8134c173a2b622ac2b265d309027 Parents: 59ccbbd Author: Joe McDonnell <[email protected]> Authored: Thu May 3 16:56:05 2018 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Thu May 10 21:48:11 2018 +0000 ---------------------------------------------------------------------- bin/impala-config.sh | 5 +++++ bin/load-data.py | 2 +- testdata/bin/create-load-data.sh | 6 ++++++ 3 files changed, 12 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/impala/blob/3bfee642/bin/impala-config.sh ---------------------------------------------------------------------- diff --git a/bin/impala-config.sh b/bin/impala-config.sh index 6dff3d7..e3bd591 100755 --- a/bin/impala-config.sh +++ b/bin/impala-config.sh @@ -177,6 +177,11 @@ export KUDU_JAVA_VERSION=1.8.0-cdh5.16.0-SNAPSHOT # ------------------------------------------ export CDH_MAJOR_VERSION=5 export IMPALA_HADOOP_VERSION=2.6.0-cdh5.16.0-SNAPSHOT +export IMPALA_MINICLUSTER_PROFILE=2 +# IMPALA-6972: Temporarily disable Hive parallelism during dataload +# The Hive version used for IMPALA_MINICLUSTER_PROFIILE=2 has a concurrency issue +# that intermittent fails parallel dataload. +export IMPALA_SERIAL_DATALOAD=1 unset IMPALA_HADOOP_URL export IMPALA_HBASE_VERSION=1.2.0-cdh5.16.0-SNAPSHOT unset IMPALA_HBASE_URL http://git-wip-us.apache.org/repos/asf/impala/blob/3bfee642/bin/load-data.py ---------------------------------------------------------------------- diff --git a/bin/load-data.py b/bin/load-data.py index 28a504f..2b9e05c 100755 --- a/bin/load-data.py +++ b/bin/load-data.py @@ -78,7 +78,7 @@ parser.add_option("--use_kerberos", action="store_true", default=False, help="Load data on a kerberized cluster.") parser.add_option("--principal", default=None, dest="principal", help="Kerberos service principal, required if --use_kerberos is set") -parser.add_option("--num_processes", default=multiprocessing.cpu_count(), +parser.add_option("--num_processes", type="int", default=multiprocessing.cpu_count(), dest="num_processes", help="Number of parallel processes to use.") options, args = parser.parse_args() http://git-wip-us.apache.org/repos/asf/impala/blob/3bfee642/testdata/bin/create-load-data.sh ---------------------------------------------------------------------- diff --git a/testdata/bin/create-load-data.sh b/testdata/bin/create-load-data.sh index fcb7e69..a97a2e2 100755 --- a/testdata/bin/create-load-data.sh +++ b/testdata/bin/create-load-data.sh @@ -41,6 +41,7 @@ trap 'echo Error in $0 at line $LINENO: $(cd "'$PWD'" && awk "NR == $LINENO" $0) : ${IMPALAD=localhost:21000} : ${REMOTE_LOAD=} : ${CM_HOST=} +: ${IMPALA_SERIAL_DATALOAD=} SKIP_METADATA_LOAD=0 SKIP_SNAPSHOT_LOAD=0 @@ -208,6 +209,11 @@ function load-data { ARGS+=("--hive_hs2_hostport ${HS2_HOST_PORT}") ARGS+=("--hdfs_namenode ${HDFS_NN}") + # Disable parallelism for dataload if IMPALA_SERIAL_DATALOAD is set + if [[ "${IMPALA_SERIAL_DATALOAD}" -eq 1 ]]; then + ARGS+=("--num_processes 1") + fi + if [[ -n ${TABLE_FORMATS} ]]; then # TBL_FMT_STR replaces slashes with underscores, # e.g., kudu/none/none -> kudu_none_none
