Repository: incubator-impala Updated Branches: refs/heads/master 9b923a1a2 -> 768fc0ea2
IMPALA-4593,IMPALA-4635: fix some python build issues Build C/C++ packages with toolchain GCC to avoid ABI compatibility issues. This requires a multi-step bootstrapping process: 1. install basic non-C/C++ packages into the virtualenv 2. use Python 2.7 from the virtualenv to bootstrap the toolchain 3. use toolchain gcc to build C/C++ packages 4. build the kudu-python package with toolchain gcc and Cython To avoid potentially pulling in cached versions of packages built with a different compiler, this patch also disables pip's caching. This should not have a significant effect on performance since we've enabled ccache and cache downloaded packages in infra/python/deps. Improve bootstrapping time significantly by using ccache and by parallelising the numpy build - the most expensive part of the install process. On a system with a warmed-up ccache, bootstrapping after deleting infra/python/env takes 1m16s. Previously it could take over 5m. Testing: Tested manually on Ubuntu 16.04 to confirm that it fixes the ABI problem mentioned in IMPALA-4593. Initially "import kudu" failed in my dev environment. After deleting infra/python/env and re-bootstrapping, "import kudu" succeeded. Also ran the standard test suite on CentOS 6 and built Impala on a range of platforms (CentOS 5,6,7; SLES 11,12; Debian 6,7; Ubuntu12.04,14.04,16.04) to make sure nothing broke. Change-Id: I9e807510eddeb354069e0478363f649a1c1b75cf Reviewed-on: http://gerrit.cloudera.org:8080/6218 Reviewed-by: Tim Armstrong <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/c8e15e48 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/c8e15e48 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/c8e15e48 Branch: refs/heads/master Commit: c8e15e484ce7400da0c36a603b72514c1f41cb00 Parents: 9b923a1 Author: Tim Armstrong <[email protected]> Authored: Tue Feb 28 17:18:30 2017 -0800 Committer: Impala Public Jenkins <[email protected]> Committed: Tue Mar 7 02:56:18 2017 +0000 ---------------------------------------------------------------------- infra/python/README | 3 +- infra/python/bootstrap_virtualenv.py | 188 +++++++++++++++-------- infra/python/deps/compiled-requirements.txt | 38 +++++ infra/python/deps/download_requirements | 7 +- infra/python/deps/kudu-requirements.txt | 22 +++ infra/python/deps/pip_download.py | 10 +- infra/python/deps/requirements.txt | 27 +--- 7 files changed, 198 insertions(+), 97 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/c8e15e48/infra/python/README ---------------------------------------------------------------------- diff --git a/infra/python/README b/infra/python/README index f63b198..9713dad 100644 --- a/infra/python/README +++ b/infra/python/README @@ -1,6 +1,7 @@ To install new packages: -1) Add your package to deps/requirements.txt. You should specify the version number +1) Add your package to deps/requirements.txt, or deps/compiled-requirements.txt if the + the package needs a C/C++ compiler to build . You should specify the version number using the "foo == x.y.z" notation so future upgrades can be done automatically. 2) Run deps/download_requirements, it will download the package to the deps dir. 3) Run the "impala-python" command, this should detect that requirements.txt changed and http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/c8e15e48/infra/python/bootstrap_virtualenv.py ---------------------------------------------------------------------- diff --git a/infra/python/bootstrap_virtualenv.py b/infra/python/bootstrap_virtualenv.py index 9c5fa13..0d6b3c8 100644 --- a/infra/python/bootstrap_virtualenv.py +++ b/infra/python/bootstrap_virtualenv.py @@ -15,9 +15,18 @@ # specific language governing permissions and limitations # under the License. -# This module will create a python virtual env and install external dependencies. If -# the virtualenv already exists and the list of dependencies matches the list of -# installed dependencies, nothing will be done. +# This module will create a python virtual env and install external dependencies. If the +# virtualenv already exists and it contains all the expected packages, nothing is done. +# +# A multi-step bootstrapping process is required to build and install all of the +# dependencies: +# 1. install basic non-C/C++ packages into the virtualenv +# 2. use the virtualenv Python to bootstrap the toolchain +# 3. use toolchain gcc to build C/C++ packages +# 4. build the kudu-python package with toolchain gcc and Cython +# +# Every time this script is run, it completes as many of the bootstrapping steps as +# possible with the available dependencies. # # This module can be run with python >= 2.4 but python >= 2.6 must be installed on the # system. If the default 'python' command refers to < 2.6, python 2.6 will be used @@ -43,10 +52,13 @@ ENV_DIR = os.path.join(os.path.dirname(__file__), "env") # Generated using "pip install --download <DIR> -r requirements.txt" REQS_PATH = os.path.join(DEPS_DIR, "requirements.txt") -# After installing, the requirements.txt will be copied into the virtualenv to -# record what was installed. -INSTALLED_REQS_PATH = os.path.join(ENV_DIR, "installed-requirements.txt") +# Requirements for the next bootstrapping step that builds compiled requirements +# with toolchain gcc. +COMPILED_REQS_PATH = os.path.join(DEPS_DIR, "compiled-requirements.txt") +# Requirements for the Kudu bootstrapping step, which depends on Cython being installed +# by the compiled requirements step. +KUDU_REQS_PATH = os.path.join(DEPS_DIR, "kudu-requirements.txt") def delete_virtualenv_if_exist(): if os.path.exists(ENV_DIR): @@ -80,17 +92,53 @@ def exec_cmd(args, **kwargs): % (args, output)) return output +def use_ccache(): + '''Returns true if ccache is available and should be used''' + if 'DISABLE_CCACHE' in os.environ: return False + try: + exec_cmd(['ccache', '-V']) + return True + except: + return False + +def select_cc(): + '''Return the C compiler command that should be used as a string or None if the + compiler is not available ''' + # Use toolchain gcc for ABI compatibility with other toolchain packages, e.g. + # Kudu/kudu-python + if not have_toolchain(): return None + toolchain_gcc_dir = toolchain_pkg_dir("gcc") + cc = os.path.join(toolchain_gcc_dir, "bin/gcc") + if not os.path.exists(cc): return None + if use_ccache(): cc = "ccache %s" % cc + return cc + +def exec_pip_install(args, cc="no-cc-available", env=None): + '''Executes "pip install" with the provided command line arguments. If 'cc' is set, + it is used as the C compiler. Otherwise compilation of C/C++ code is disabled by + setting the CC environment variable to a bogus value. + Other environment vars can optionally be set with the 'env' argument. By default the + current process's command line arguments are inherited.''' + if not env: env = dict(os.environ) + env["CC"] = cc + + # Parallelize the slow numpy build. + # Use getconf instead of nproc because it is supported more widely, e.g. on older + # linux distributions. + env["NPY_NUM_BUILD_JOBS"] = exec_cmd(["getconf", "_NPROCESSORS_ONLN"]).strip() -def exec_pip_install(args, **popen_kwargs): # Don't call the virtualenv pip directly, it uses a hashbang to to call the python # virtualenv using an absolute path. If the path to the virtualenv is very long, the # hashbang won't work. # # Passes --no-binary for IMPALA-3767: without this, Cython (and # several other packages) fail download. + # + # --no-cache-dir is used to prevent caching of compiled artifacts, which may be built + # with different compilers or settings. exec_cmd([os.path.join(ENV_DIR, "bin", "python"), os.path.join(ENV_DIR, "bin", "pip"), - "install", "--no-binary", "--no-index", "--find-links", - "file://%s" % urllib.pathname2url(os.path.abspath(DEPS_DIR))] + args, **popen_kwargs) + "install", "--no-binary", "--no-index", "--no-cache-dir", "--find-links", + "file://%s" % urllib.pathname2url(os.path.abspath(DEPS_DIR))] + args, env=env) def find_file(*paths): @@ -128,60 +176,62 @@ def detect_python_cmd(): def install_deps(): LOG.info("Installing packages into the virtualenv") exec_pip_install(["-r", REQS_PATH]) - shutil.copyfile(REQS_PATH, INSTALLED_REQS_PATH) + mark_reqs_installed(REQS_PATH) + +def have_toolchain(): + '''Return true if the Impala toolchain is available''' + return "IMPALA_TOOLCHAIN" in os.environ + +def toolchain_pkg_dir(pkg_name): + '''Return the path to the toolchain package''' + pkg_version = os.environ["IMPALA_" + pkg_name.upper() + "_VERSION"] + return os.path.join(os.environ["IMPALA_TOOLCHAIN"], pkg_name + "-" + pkg_version) + +def install_compiled_deps_if_possible(): + '''Install dependencies that require compilation with toolchain GCC, if the toolchain + is available. Returns true if the deps are installed''' + if reqs_are_installed(COMPILED_REQS_PATH): + LOG.debug("Skipping compiled deps: matching compiled-installed-requirements.txt found") + return True + cc = select_cc() + if cc is None: + LOG.debug("Skipping compiled deps: cc not available yet") + return False + + env = dict(os.environ) + # Compilation of pycrypto fails on CentOS 5 with newer GCC versions because of a + # problem with inline declarations in older libc headers. Setting -fgnu89-inline is a + # workaround. + distro_version = ''.join(exec_cmd(["lsb_release", "-irs"]).lower().split()) + print distro_version + if distro_version.startswith("centos5."): + env["CFLAGS"] = "-fgnu89-inline" + + LOG.info("Installing compiled requirements into the virtualenv") + exec_pip_install(["-r", COMPILED_REQS_PATH], cc=cc, env=env) + mark_reqs_installed(COMPILED_REQS_PATH) + return True def install_kudu_client_if_possible(): - """Installs the Kudu python module if possible. The Kudu module is the only one that - requires the toolchain. If the toolchain isn't in use or hasn't been populated - yet, nothing will be done. Also nothing will be done if the Kudu client lib required - by the module isn't available (as determined by KUDU_IS_SUPPORTED). - """ + '''Installs the Kudu python module if possible, which depends on the toolchain and + the compiled requirements in compiled-requirements.txt. If the toolchain isn't + available, nothing will be done. Also nothing will be done if the Kudu client lib + required by the module isn't available (as determined by KUDU_IS_SUPPORTED)''' + if reqs_are_installed(KUDU_REQS_PATH): + LOG.debug("Skipping Kudu: matching kudu-installed-requirements.txt found") + return if os.environ["KUDU_IS_SUPPORTED"] != "true": LOG.debug("Skipping Kudu: Kudu is not supported") return - impala_toolchain_dir = os.environ.get("IMPALA_TOOLCHAIN") - if not impala_toolchain_dir: + if not have_toolchain(): LOG.debug("Skipping Kudu: IMPALA_TOOLCHAIN not set") return - toolchain_kudu_dir = os.path.join( - impala_toolchain_dir, "kudu-" + os.environ["IMPALA_KUDU_VERSION"]) + toolchain_kudu_dir = toolchain_pkg_dir("kudu") if not os.path.exists(toolchain_kudu_dir): LOG.debug("Skipping Kudu: %s doesn't exist" % toolchain_kudu_dir) return - # The "pip" command could be used to provide the version of Kudu installed (if any) - # but it's a little too slow. Running the virtualenv python to detect the installed - # version is faster. - actual_version_string = exec_cmd([os.path.join(ENV_DIR, "bin", "python"), "-c", - textwrap.dedent(""" - try: - import kudu - print kudu.__version__ - except ImportError: - pass""")]).strip() - actual_version = [int(v) for v in actual_version_string.split(".") if v] - - reqs_file = open(REQS_PATH) - try: - for line in reqs_file: - if not line.startswith("# kudu-python=="): - continue - expected_version_string = line.split()[1].split("==")[1] - break - else: - raise Exception("Unable to find kudu-python version in requirements file") - finally: - reqs_file.close() - expected_version = [int(v) for v in expected_version_string.split(".")] - - if actual_version and actual_version == expected_version: - LOG.debug("Skipping Kudu: Installed %s == required %s" - % (actual_version_string, expected_version_string)) - return - LOG.debug("Kudu installation required. Actual version %s. Required version %s.", - actual_version, expected_version) - LOG.info("Installing Kudu into the virtualenv") # The installation requires that KUDU_HOME/build/latest exists. An empty directory # structure will be made to satisfy that. The Kudu client headers and lib will be made @@ -191,14 +241,16 @@ def install_kudu_client_if_possible(): artifact_dir = os.path.join(fake_kudu_build_dir, "build", "latest") if not os.path.exists(artifact_dir): os.makedirs(artifact_dir) + cc = select_cc() + assert cc is not None env = dict(os.environ) env["KUDU_HOME"] = fake_kudu_build_dir kudu_client_dir = find_kudu_client_install_dir() env["CPLUS_INCLUDE_PATH"] = os.path.join(kudu_client_dir, "include") env["LIBRARY_PATH"] = os.path.pathsep.join([os.path.join(kudu_client_dir, 'lib'), os.path.join(kudu_client_dir, 'lib64')]) - - exec_pip_install(["kudu-python==" + expected_version_string], env=env) + exec_pip_install(["-r", KUDU_REQS_PATH], cc=cc, env=env) + mark_reqs_installed(KUDU_REQS_PATH) finally: try: shutil.rmtree(fake_kudu_build_dir) @@ -238,31 +290,38 @@ def error_if_kudu_client_not_found(install_dir): return raise Exception("%s not found at %s" % (kudu_client_lib, lib_dir)) - -def deps_are_installed(): - if not os.path.exists(INSTALLED_REQS_PATH): +def mark_reqs_installed(reqs_path): + '''Mark that the requirements from the given file are installed by copying it into the root + directory of the virtualenv.''' + installed_reqs_path = os.path.join(ENV_DIR, os.path.basename(reqs_path)) + shutil.copyfile(reqs_path, installed_reqs_path) + +def reqs_are_installed(reqs_path): + '''Check if the requirements from the given file are installed in the virtualenv by + looking for a matching requirements file in the root directory of the virtualenv.''' + installed_reqs_path = os.path.join(ENV_DIR, os.path.basename(reqs_path)) + if not os.path.exists(installed_reqs_path): return False - installed_reqs_file = open(INSTALLED_REQS_PATH) + installed_reqs_file = open(installed_reqs_path) try: - reqs_file = open(REQS_PATH) + reqs_file = open(reqs_path) try: if reqs_file.read() == installed_reqs_file.read(): return True else: - LOG.info("Virtualenv upgrade needed") + LOG.debug("Virtualenv upgrade needed") return False finally: reqs_file.close() finally: installed_reqs_file.close() - def setup_virtualenv_if_not_exists(): - if not deps_are_installed(): + if not reqs_are_installed(REQS_PATH): delete_virtualenv_if_exist() create_virtualenv() install_deps() - LOG.info("Virtualenv setup complete") + LOG.debug("Virtualenv setup complete") if __name__ == "__main__": @@ -284,5 +343,8 @@ if __name__ == "__main__": logging.basicConfig(level=getattr(logging, options.log_level)) if options.rebuild: delete_virtualenv_if_exist() + + # Complete as many bootstrap steps as possible (see file comment for the steps). setup_virtualenv_if_not_exists() - install_kudu_client_if_possible() + if install_compiled_deps_if_possible(): + install_kudu_client_if_possible() http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/c8e15e48/infra/python/deps/compiled-requirements.txt ---------------------------------------------------------------------- diff --git a/infra/python/deps/compiled-requirements.txt b/infra/python/deps/compiled-requirements.txt new file mode 100644 index 0000000..945e3f6 --- /dev/null +++ b/infra/python/deps/compiled-requirements.txt @@ -0,0 +1,38 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Requirements that require a C/C++ compiler to build, which may not be available until +# after the toolchain is bootstrapped. Installed after requirements.txt + +argparse == 1.4.0 +Fabric == 1.10.2 + paramiko == 1.15.2 + ecdsa == 0.13 + pycrypto == 2.6.1 +impyla == 0.14.0 + bitarray == 0.8.1 + sasl == 0.1.3 + six == 1.9.0 + # Thrift usually comes from the thirdparty dir but in case the virtualenv is needed + # before thirdparty is built thrift will be installed anyways. + thrift == 0.9.0 + thrift_sasl == 0.1.0 +psutil == 0.7.1 + +# Required for Kudu: + Cython == 0.23.4 + numpy == 1.10.4 http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/c8e15e48/infra/python/deps/download_requirements ---------------------------------------------------------------------- diff --git a/infra/python/deps/download_requirements b/infra/python/deps/download_requirements index f610cab..a1035b3 100755 --- a/infra/python/deps/download_requirements +++ b/infra/python/deps/download_requirements @@ -23,11 +23,6 @@ DIR="$(dirname "$0")" pushd "$DIR" PY26="$(./find_py26.py)" -# Directly download packages listed in requirements.txt, but don't install them. +# Directly download packages listed in *requirements.txt, but don't install them. "$PY26" pip_download.py -# For virtualenv, other scripts rely on the .tar.gz package (not a .whl package). -"$PY26" pip_download.py virtualenv 13.1.0 -# kudu-python is downloaded separately because pip install attempts to execute a -# setup.py subcommand for kudu-python that can fail even if the download succeeds. -"$PY26" pip_download.py kudu-python 1.2.0 popd http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/c8e15e48/infra/python/deps/kudu-requirements.txt ---------------------------------------------------------------------- diff --git a/infra/python/deps/kudu-requirements.txt b/infra/python/deps/kudu-requirements.txt new file mode 100644 index 0000000..6dd4ada --- /dev/null +++ b/infra/python/deps/kudu-requirements.txt @@ -0,0 +1,22 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# kudu-python depends on the Kudu client library and compilers provided by the toolchain, +# and also depends on Cython being installed into the virtualenv, so it must be installed +# after the toolchain is bootstrapped and all requirements in requirements.txt and +# compiled-requirements.txt are installed into the virtualenv. +kudu-python==1.2.0 http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/c8e15e48/infra/python/deps/pip_download.py ---------------------------------------------------------------------- diff --git a/infra/python/deps/pip_download.py b/infra/python/deps/pip_download.py index a3c6a09..85def64 100755 --- a/infra/python/deps/pip_download.py +++ b/infra/python/deps/pip_download.py @@ -33,6 +33,10 @@ NUM_TRIES = 3 PYPI_MIRROR = os.environ.get("PYPI_MIRROR", "https://pypi.python.org") +# The requirement files that list all of the required packages and versions. +REQUIREMENTS_FILES = ['requirements.txt', 'compiled-requirements.txt', + 'kudu-requirements.txt'] + def check_md5sum(filename, expected_md5): actual_md5 = md5(open(filename).read()).hexdigest() return actual_md5 == expected_md5 @@ -87,10 +91,12 @@ def main(): if len(sys.argv) > 1: _, pkg_name, pkg_version = sys.argv download_package(pkg_name, pkg_version) - else: + return + + for requirements_file in REQUIREMENTS_FILES: # If the package name and version are not specified in the command line arguments, # download the packages that in requirements.txt. - f = open("requirements.txt", 'r') + f = open(requirements_file, 'r') try: # requirements.txt follows the standard pip grammar. for line in f: http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/c8e15e48/infra/python/deps/requirements.txt ---------------------------------------------------------------------- diff --git a/infra/python/deps/requirements.txt b/infra/python/deps/requirements.txt index 1fa5a28..7d9d484 100644 --- a/infra/python/deps/requirements.txt +++ b/infra/python/deps/requirements.txt @@ -23,7 +23,6 @@ # multiple times (though maybe they could be). allpairs == 2.0.1 -argparse == 1.4.0 boto3 == 1.2.3 simplejson == 3.3.0 # For python version 2.6 botocore == 1.3.30 @@ -34,10 +33,6 @@ boto3 == 1.2.3 cm-api == 10.0.0 # Already available as part of python on Linux. readline == 6.2.4.1; sys_platform == 'darwin' -Fabric == 1.10.2 - paramiko == 1.15.2 - ecdsa == 0.13 - pycrypto == 2.6.1 Flask == 0.10.1 Jinja2 == 2.8 MarkupSafe == 0.23 @@ -46,21 +41,12 @@ Flask == 0.10.1 hdfs == 2.0.2 docopt == 0.6.2 execnet == 1.4.0 -impyla == 0.14.0 - bitarray == 0.8.1 - sasl == 0.1.3 - six == 1.9.0 - # Thrift usually comes from the thirdparty dir but in case the virtualenv is needed - # before thirdparty is built thrift will be installed anyways. - thrift == 0.9.0 - thrift_sasl == 0.1.0 kazoo == 2.2.1 monkeypatch == 0.1rc3 ordereddict == 1.1 pexpect == 3.3 pg8000 == 1.10.2 prettytable == 0.7.2 -psutil == 0.7.1 pyelftools == 0.23 pyparsing == 2.0.3 pytest == 2.9.2 @@ -75,17 +61,8 @@ sh == 1.11 sqlparse == 0.1.15 texttable == 0.8.3 -# kudu-python is needed but cannot be listed as usual. The Kudu client lib (.so file) -# is needed for compilation/installation but the client lib is provided by the toolchain. -# The virtualenv may need to be functional even if the toolchain isn't present. The -# bootstap_virtualenv.py script special-cases kudu-python, the line below is actually -# functional and determines the expected kudu-python version. The version must be listed -# in the format below including # and spacing. Keep this formatting! The kudu-python -# version in download_requirements must be kept in sync with this version. -# kudu-python==1.2.0 - Cython == 0.23.4 - numpy == 1.10.4 - # For dev purposes, not used in scripting. Version 1.2.1 is the latest that supports 2.6. ipython == 1.2.1 apipkg == 1.4 + +virtualenv == 13.1.0
