[
https://issues.apache.org/jira/browse/IMPALA-9371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039634#comment-17039634
]
Tim Armstrong commented on IMPALA-9371:
---------------------------------------
Based on the trial run, I think there's some high impact stuff here for compile
times. Some of it is just mechanically applying recommendations. E.g.
timestamp-value.h is included a lot of places and pulls in a lot of large boost
headers. Here are the recommendations:
{code}
be/src/runtime/timestamp-value.h should add these lines:
#include <bits/types/time_t.h> // for time_t
#include <gflags/gflags_declare.h> // for DECLARE_...
#include <string.h> // for memcpy
#include <boost/date_time/date.hpp> // for date
#include <boost/date_time/gregorian/greg_date.hpp> // for date
#include <boost/date_time/posix_time/posix_time_config.hpp> // for time_dur...
#include <boost/date_time/posix_time/ptime.hpp> // for ptime
#include <boost/date_time/special_defs.hpp> // for special_...
#include <boost/date_time/time_duration.hpp> // for time_dur...
#include <boost/operators.hpp> // for operator>
#include <cstdint> // for int64_t
#include <iosfwd> // for ostream
#include "common/compiler-util.h" // for UNLIKELY
#include "common/logging.h" // for COMPACT_...
#include "gen-cpp/common.pb.h" // for ColumnVa...
be/src/runtime/timestamp-value.h should remove these lines:
- #include <gflags/gflags.h> // lines 25-25
- #include <boost/date_time/compiler_config.hpp> // lines 22-22
- #include <boost/date_time/gregorian/gregorian.hpp> // lines 23-23
- #include <boost/date_time/local_time/local_time.hpp> // lines 24-24
- #include "gen-cpp/Data_types.h" // lines 29-29
- #include "gen-cpp/data_stream_service.pb.h" // lines 30-30
The full include-list for be/src/runtime/timestamp-value.h:
#include <bits/types/time_t.h> // for time_t
#include <gflags/gflags_declare.h> // for DECLARE_...
#include <string.h> // for memcpy
#include <boost/date_time/date.hpp> // for date
#include <boost/date_time/gregorian/greg_date.hpp> // for date
#include <boost/date_time/posix_time/posix_time_config.hpp> // for time_dur...
#include <boost/date_time/posix_time/ptime.hpp> // for ptime
#include <boost/date_time/special_defs.hpp> // for special_...
#include <boost/date_time/time_duration.hpp> // for time_dur...
#include <boost/operators.hpp> // for operator>
#include <cstdint> // for int64_t
#include <iosfwd> // for ostream
#include <string> // for string
#include "common/compiler-util.h" // for UNLIKELY
#include "common/global-types.h" // for Timezone
#include "common/logging.h" // for COMPACT_...
#include "gen-cpp/common.pb.h" // for ColumnVa...
#include "udf/udf.h" // for Timestam...
#include "util/hash-util.h" // for HashUtil
namespace impala { namespace datetime_parse_util { struct
DateTimeFormatContext; } } // lines 42-42
{code}
The output also made some weird dependencies more obvious to me. E.g.
TmpFileMgr is a lower-level subsystem that the vast majority of compilation
units never call into. Yet it's included in query-state.h and buffer-pool.h
because they need access to the nested class definition of FileGroup. If we
lift FileGroup out of TmpFileMgr, we should be able to forward declare it:
{code}
be/src/runtime/query-state.h should add these lines:
#include <cstdint> // for int64_t, int32_t
#include <string> // for string
#include "common/compiler-util.h" // for discard_result
#include "common/logging.h" // for DCHECK_GT
#include "common/status.h" // for Status, WARN_UN...
#include "gen-cpp/control_service.pb.h" // for ExecQueryFInsta...
#include "gutil/macros.h" // for DISALLOW_COPY_A...
#include "util/spinlock.h" // for SpinLock
namespace impala { class DescriptorTbl; }
namespace impala { class PublishFilterParamsPB; }
namespace impala { class RuntimeProfile; }
namespace impala { class TRuntimeProfileForest; }
be/src/runtime/query-state.h should remove these lines:
- #include <mutex> // lines 22-22
- #include "gen-cpp/data_stream_service.pb.h" // lines 29-29
- #include "util/container-util.h" // lines 32-32
- #include "util/uid-util.h" // lines 34-34
- namespace impala { class ReportExecStatusRequestPB; } // lines 49-49
- namespace impala { class ThriftSerializer; } // lines 53-53
The full include-list for be/src/runtime/query-state.h:
#include <cstdint> // for int64_t, int32_t
#include <memory> // for unique_ptr
#include <string> // for string
#include <unordered_map> // for unordered_map
#include "common/atomic.h" // for AtomicInt32
#include "common/compiler-util.h" // for discard_result
#include "common/logging.h" // for DCHECK_GT
#include "common/object-pool.h" // for ObjectPool
#include "common/status.h" // for Status, WARN_UN...
#include "gen-cpp/ImpalaInternalService_types.h" // for TQueryCtx, TCli...
#include "gen-cpp/Types_types.h" // for TUniqueId
#include "gen-cpp/control_service.pb.h" // for ExecQueryFInsta...
#include "gutil/macros.h" // for DISALLOW_COPY_A...
#include "gutil/threading/thread_collision_warner.h" // for DFAKE_MUTEX
#include "runtime/tmp-file-mgr.h" // for TmpFileMgr
#include "util/counting-barrier.h" // for CountingBarrier
#include "util/spinlock.h" // for SpinLock
namespace impala { class ControlServiceProxy; } // lines 44-44
namespace impala { class DescriptorTbl; }
namespace impala { class FragmentInstanceState; } // lines 46-46
namespace impala { class InitialReservations; } // lines 47-47
namespace impala { class MemTracker; } // lines 48-48
namespace impala { class PublishFilterParamsPB; }
namespace impala { class ReservationTracker; } // lines 50-50
namespace impala { class RuntimeFilterBank; } // lines 45-45
namespace impala { class RuntimeProfile; }
namespace impala { class RuntimeState; } // lines 51-51
namespace impala { class ScannerMemLimiter; } // lines 52-52
namespace impala { class TRuntimeProfileForest; }
namespace kudu { namespace rpc { class RpcContext; } } // lines 38-38
{code}
A similar situation exists with statestore/statestore-subscriber.h because of
StatestoreSubscriber::TopicDeltaMap being referenced in various headers.
> Investigate include-what-you-use to improve compile times
> ---------------------------------------------------------
>
> Key: IMPALA-9371
> URL: https://issues.apache.org/jira/browse/IMPALA-9371
> Project: IMPALA
> Issue Type: Improvement
> Components: Infrastructure
> Reporter: Tim Armstrong
> Priority: Major
> Attachments: iwyu.sh, iwyu2.txt, iwyu4.txt, iwyu_mappings.imp
>
>
> Impala compile times are longer than necessary because .cc files pull in a
> lot of headers, especially from the standard C++ library and boost. I looked
> at the preprocessor output for a couple of files, and it included ~400,000
> lines of code, almost entirely from headers.
> include-what-you-use is a tool that can identify cases where header includes
> can be avoided or moved from .h to .cc files to reduce the number of
> extraneous headers pulled into each compile unit.
> Kudu has some tooling to run IWYU in a dev environment and also in their
> gerrit precommit.
> I was able to run iwyu against Impala as follows:
> {code:bash}
> # Clone repo and check out branch to match Impala toolchain's clang version
> git clone https://github.com/include-what-you-use/include-what-you-use.git
> cd include-what-you-use
> git checkout clang_5.0
> # Build IWYU against clang from toolchain
> cmake -G "Unix Makefiles"
> -DIWYU_LLVM_ROOT_PATH=/opt/Impala-Toolchain/llvm-5.0.1-p2/
> -DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 ..
> make -j $(nproc)
> IWYU_BUILD_DIR=$(pwd)
> # Run IWYU on the impala codebase.
> cd $IMPALA_HOME
> # Do an ASAN build to force Impala to generate compile_commands.json with
> compile commands suitable for clang.
> ./buildall.sh -noclean -notests -asan
> # Placeholder for additional arguments to the iwyu tool
> IWYU_ARGS=""
> PATH=$IWYU_BUILD_DIR:$PATH ~/repos/include-what-you-use/iwyu_tool.py -p . -j
> $(nproc) -- $IWYU_ARGS > iwyu.txt
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]