This is an automated email from the ASF dual-hosted git repository.
joemcdonnell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
The following commit(s) were added to refs/heads/master by this push:
new 10c19b1a5 IMPALA-11511: Add build options for reducing binary sizes
10c19b1a5 is described below
commit 10c19b1a5730a898e17cc653be6bd19f0dc3340e
Author: Joe McDonnell <[email protected]>
AuthorDate: Fri Sep 9 16:40:05 2022 -0700
IMPALA-11511: Add build options for reducing binary sizes
Impala's build produces dozens of C++ binaries
that link in all Impala libraries. Each binary is
hundreds of megabytes, leading to 10s of gigabytes
of disk space. A large proportion of this (~80%) is debug
information. The debug information increases in newer
versions of GCC such as GCC 10.
This introduces two options for reducing the size
of debug information:
- IMPALA_MINIMAL_DEBUG_INFO=true builds Impala with
minimal debug information (-g1). This contains line tables
and can resolve backtraces, but it does not contain
variable information and restricts further debugging.
- IMPALA_COMPRESSED_DEBUG_INFO=true builds Impala with
compressed debug information (-gz). This does not change
the debug information included, but the compression saves
significant disk space. gdb is known to work with
compressed debug information, but other tools may not
support it. The dump_breakpad_symbols.py script has been
adjusted to handle these binaries.
These are disabled by default.
Release impalad binary sizes:
Configuration | Size (bytes) | % reduction over base
Base | 707834808 | N/A
Stripped | 83351664 | 88%
Minimal debuginfo | 215924096 | 69%
Compressed debuginfo | 301619286 | 57%
Minimal + compressed debuginfo | 120886705 | 83%
Testing:
- Generated minidumps and resolved them
- Verified this is disabled by default
Change-Id: I04a20258a86053d8f3972b9c7c81cd5bec1bbb66
Reviewed-on: http://gerrit.cloudera.org:8080/18962
Reviewed-by: Michael Smith <[email protected]>
Reviewed-by: Wenzhe Zhou <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
---
be/CMakeLists.txt | 27 +++++++++++++++++++
bin/dump_breakpad_symbols.py | 63 ++++++++++++++++++++++++++++++++++++++++----
bin/impala-config.sh | 23 ++++++++++++++++
3 files changed, 108 insertions(+), 5 deletions(-)
diff --git a/be/CMakeLists.txt b/be/CMakeLists.txt
index 77d5c3050..ffd050b2f 100644
--- a/be/CMakeLists.txt
+++ b/be/CMakeLists.txt
@@ -221,6 +221,33 @@ SET(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS}
${CMAKE_CXX_FLAGS}")
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fverbose-asm")
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${LLVM_CFLAGS}")
+# The IMPALA_MINIMAL_DEBUG_INFO option saves diskspace by reducing the debug
info
+# in binaries to the minimal level that can do backtraces. The "-g1" option
+# keeps line number tables, but it does not keep variable information. This
+# can reduce the size of binaries by >%60. This is appended to the end of
arguments
+# so that it overrides other "-g" arguments.
+if ($ENV{IMPALA_MINIMAL_DEBUG_INFO} STREQUAL "true")
+ SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g1")
+ # The choice of CMAKE_BUILD_TYPE specifies a set of flags that are added
+ # after the flags in CMAKE_CXX_FLAGS. CMAKE_BUILD_TYPE=Debug adds "-g", which
+ # overrides our "-g1" because it is later in the argument list. To fix this,
+ # this overrides CMake's flags for CMAKE_BUILD_TYPE=Debug to use "-g1" rather
+ # than "-g". CMAKE_BUILD_TYPE=Release and other CMAKE_BUILD_TYPEs that we use
+ # don't include a "-g" flag, so they don't need similar treatment.
+ SET(CMAKE_CXX_FLAGS_DEBUG "-g1")
+endif()
+
+# The IMPALA_COMPRESSED_DEBUG_INFO option saves diskspace by compressing the
+# debug info in the executable. This can reduce the size of binaries by >50%
+# without changing the amount of debug information. gdb is known to work
+# with compressed debug info, but other tools may not know how to use it.
+# TODO: The current version of Clang does not handles this flag correctly and
+# simply produces binaries with uncompressed debug info. This needs further
+# debugging.
+if ($ENV{IMPALA_COMPRESSED_DEBUG_INFO} STREQUAL "true")
+ SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -gz")
+endif()
+
# Use ccache when found and not explicitly disabled by setting the
DISABLE_CCACHE envvar.
find_program(CCACHE ccache)
set(RULE_LAUNCH_PREFIX)
diff --git a/bin/dump_breakpad_symbols.py b/bin/dump_breakpad_symbols.py
index 3a6f02e21..b4edff494 100755
--- a/bin/dump_breakpad_symbols.py
+++ b/bin/dump_breakpad_symbols.py
@@ -97,6 +97,27 @@ def find_dump_syms_binary():
return ''
+def find_objcopy_binary():
+ """Locate the 'objcopy' binary from Binutils.
+
+ We try to locate the package in the Impala toolchain folder.
+ TODO: Fall back to finding objcopy in the system path.
+ """
+ toolchain_packages_home = os.environ.get('IMPALA_TOOLCHAIN_PACKAGES_HOME')
+ if toolchain_packages_home:
+ if not os.path.isdir(toolchain_packages_home):
+ die('Could not find toolchain packages directory')
+ binutils_version = os.environ.get('IMPALA_BINUTILS_VERSION')
+ if not binutils_version:
+ die('Could not determine binutils version from toolchain')
+ binutils_dir = 'binutils-%s' % binutils_version
+ objcopy = os.path.join(toolchain_packages_home, binutils_dir, 'bin',
'objcopy')
+ if not os.path.isfile(objcopy):
+ die('Could not find objcopy executable at %s' % objcopy)
+ return objcopy
+ return ''
+
+
def parse_args():
"""Parse command line arguments and perform sanity checks."""
parser = ArgumentParser()
@@ -114,6 +135,7 @@ def parse_args():
to process, use with -s""")
parser.add_argument('-s', '--symbol_pkg', '--debuginfo_rpm', help="""RPM/DEB
file
containing the debug symbols matching the binaries in -r""")
+ parser.add_argument('--objcopy', help='Path to the objcopy binary from
Binutils')
args = parser.parse_args()
# Post processing checks
@@ -237,7 +259,7 @@ def enumerate_binaries(args):
die('No input method provided')
-def process_binary(dump_syms, binary, out_dir):
+def process_binary(dump_syms, objcopy, binary, out_dir):
"""Dump symbols of a single binary file and move the result.
Symbols will be extracted to a temporary file and moved into place
afterwards. Required
@@ -249,14 +271,40 @@ def process_binary(dump_syms, binary, out_dir):
# destroyed.
tmp_fd, tmp_file = tempfile.mkstemp(dir=out_dir, suffix='.sym')
try:
- # Run dump_syms on the binary.
- args = [dump_syms, binary.path]
+ # Create a temporary directory used for decompressing debug info
+ tempdir = tempfile.mkdtemp()
+
+ # Binaries can contain compressed debug symbols. Breakpad currently
+ # does not support dumping symbols for binaries with compressed debug
+ # symbols.
+ #
+ # As a workaround, this uses objcopy to create a copy of the binary with
+ # the debug symbols decompressed. If the debug symbols are not compressed
+ # in the original binary, objcopy simply makes a copy of the binary.
+ # Breakpad is able to read symbols from the decompressed binary, and
+ # those symbols work correctly in resolving a minidump from the original
+ # compressed binary.
+ # TODO: In theory, this could work with the binary.debug_path.
+ binary_basename = os.path.basename(binary.path)
+ decompressed_binary = os.path.join(tempdir, binary_basename)
+ objcopy_retcode = subprocess.call([objcopy, "--decompress-debug-sections",
+ binary.path, decompressed_binary])
+
+ # Run dump_syms on the binary
+ # If objcopy failed for some reason, fall back to running dump_syms
+ # directly on the original binary. This is unlikely to work, but it is a
way of
+ # guaranteeing that objcopy is not the problem.
+ args = [dump_syms, decompressed_binary]
+ if objcopy_retcode != 0:
+ sys.stderr.write('objcopy failed. Trying to run dump_sym directly.\n')
+ args = [dump_syms, binary.path]
+
if binary.debug_path:
args.append(binary.debug_path)
proc = subprocess.Popen(args, stdout=os.fdopen(tmp_fd, 'wb'),
stderr=subprocess.PIPE)
_, stderr = proc.communicate()
if proc.returncode != 0:
- sys.stderr.write('Failed to dump symbols from %s, return code %s\n' %
+ sys.stderr.write('dump_syms: Failed to dump symbols from %s, return code
%s\n' %
(binary.path, proc.returncode))
sys.stderr.write(stderr)
os.remove(tmp_file)
@@ -277,6 +325,9 @@ def process_binary(dump_syms, binary, out_dir):
except EnvironmentError:
pass
raise e
+ finally:
+ # Cleanup temporary directory
+ shutil.rmtree(tempdir)
return True
@@ -285,10 +336,12 @@ def main():
args = parse_args()
dump_syms = args.dump_syms or find_dump_syms_binary()
assert dump_syms
+ objcopy = args.objcopy or find_objcopy_binary()
+ assert objcopy
status = 0
ensure_dir_exists(args.dest_dir)
for binary in enumerate_binaries(args):
- if not process_binary(dump_syms, binary, args.dest_dir):
+ if not process_binary(dump_syms, objcopy, binary, args.dest_dir):
status = 1
sys.exit(status)
diff --git a/bin/impala-config.sh b/bin/impala-config.sh
index 34034137c..7d32d770e 100755
--- a/bin/impala-config.sh
+++ b/bin/impala-config.sh
@@ -408,6 +408,29 @@ export IMPALA_MAVEN_OPTIONS=${IMPALA_MAVEN_OPTIONS-}
# If enabled, debug symbols are added to cross-compiled IR.
export ENABLE_IMPALA_IR_DEBUG_INFO=${ENABLE_IMPALA_IR_DEBUG_INFO-false}
+# Impala has dozens of binaries that link in all the Impala libraries.
+# Each binary is hundreds of megabytes, and they end up taking 10s of GBs
+# disk space for a developer environment. A large amount of the binary
+# size is due to debug information.
+#
+# These are two options for reducing the binary size and disk space
+# usage.
+# - IMPALA_MINIMAL_DEBUG_INFO=true changes the build to produce only
+# minimal debuginfo (i.e. -g1). This has line tables and can do backtraces,
+# but it doesn't include variable information and limits further
+# debuggability. This option reduces the size of binaries by 60+%.
+# - IMPALA_COMPRESSED_DEBUG_INFO=true changes the build to compress the
+# debug info with gzip. This significantly reduces the size of the
+# binary without changing the quantity of debug information. The catch
+# is that tools need to support it. gdb is known to support it and
+# the Breakpad scripts have been modified to handle it, but there may
+# be other tools that do not know how to use it. This reduces the size
+# of binaries by 50+%.
+# Both of these are disabled by default.
+# TODO: Explore enabling IMPALA_COMPRESSED_DEBUG_INFO by default.
+export IMPALA_MINIMAL_DEBUG_INFO=${IMPALA_MINIMAL_DEBUG_INFO-false}
+export IMPALA_COMPRESSED_DEBUG_INFO=${IMPALA_COMPRESSED_DEBUG_INFO-false}
+
# Download and use the CDH components from S3. It can be useful to set this to
false if
# building against a custom local build using HIVE_SRC_DIR_OVERRIDE,
# HADOOP_INCLUDE_DIR_OVERRIDE, and HADOOP_LIB_DIR_OVERRIDE.