[ 
https://issues.apache.org/jira/browse/IMPALA-9371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039634#comment-17039634
 ] 

Tim Armstrong commented on IMPALA-9371:
---------------------------------------

Based on the trial run, I think there's some high impact stuff here for compile 
times. Some of it is just mechanically applying recommendations. E.g. 
timestamp-value.h is included a lot of places and pulls in a lot of large boost 
headers. Here are the recommendations:
{code}
be/src/runtime/timestamp-value.h should add these lines:
#include <bits/types/time_t.h>                               // for time_t
#include <gflags/gflags_declare.h>                           // for DECLARE_...
#include <string.h>                                          // for memcpy
#include <boost/date_time/date.hpp>                          // for date
#include <boost/date_time/gregorian/greg_date.hpp>           // for date
#include <boost/date_time/posix_time/posix_time_config.hpp>  // for time_dur...
#include <boost/date_time/posix_time/ptime.hpp>              // for ptime
#include <boost/date_time/special_defs.hpp>                  // for special_...
#include <boost/date_time/time_duration.hpp>                 // for time_dur...
#include <boost/operators.hpp>                               // for operator>
#include <cstdint>                                           // for int64_t
#include <iosfwd>                                            // for ostream
#include "common/compiler-util.h"                            // for UNLIKELY
#include "common/logging.h"                                  // for COMPACT_...
#include "gen-cpp/common.pb.h"                               // for ColumnVa...

be/src/runtime/timestamp-value.h should remove these lines:
- #include <gflags/gflags.h>  // lines 25-25
- #include <boost/date_time/compiler_config.hpp>  // lines 22-22
- #include <boost/date_time/gregorian/gregorian.hpp>  // lines 23-23
- #include <boost/date_time/local_time/local_time.hpp>  // lines 24-24
- #include "gen-cpp/Data_types.h"  // lines 29-29
- #include "gen-cpp/data_stream_service.pb.h"  // lines 30-30

The full include-list for be/src/runtime/timestamp-value.h:
#include <bits/types/time_t.h>                               // for time_t
#include <gflags/gflags_declare.h>                           // for DECLARE_...
#include <string.h>                                          // for memcpy
#include <boost/date_time/date.hpp>                          // for date
#include <boost/date_time/gregorian/greg_date.hpp>           // for date
#include <boost/date_time/posix_time/posix_time_config.hpp>  // for time_dur...
#include <boost/date_time/posix_time/ptime.hpp>              // for ptime
#include <boost/date_time/special_defs.hpp>                  // for special_...
#include <boost/date_time/time_duration.hpp>                 // for time_dur...
#include <boost/operators.hpp>                               // for operator>
#include <cstdint>                                           // for int64_t
#include <iosfwd>                                            // for ostream
#include <string>                                            // for string
#include "common/compiler-util.h"                            // for UNLIKELY
#include "common/global-types.h"                             // for Timezone
#include "common/logging.h"                                  // for COMPACT_...
#include "gen-cpp/common.pb.h"                               // for ColumnVa...
#include "udf/udf.h"                                         // for Timestam...
#include "util/hash-util.h"                                  // for HashUtil
namespace impala { namespace datetime_parse_util { struct 
DateTimeFormatContext; } }  // lines 42-42
{code}

The output also made some weird dependencies more obvious to me. E.g. 
TmpFileMgr is a lower-level subsystem that the vast majority of compilation 
units never call into. Yet it's included in query-state.h and buffer-pool.h 
because they need access to the nested class definition of FileGroup. If we 
lift FileGroup out of TmpFileMgr, we should be able to forward declare it:
{code}
be/src/runtime/query-state.h should add these lines:
#include <cstdint>                                    // for int64_t, int32_t
#include <string>                                     // for string
#include "common/compiler-util.h"                     // for discard_result
#include "common/logging.h"                           // for DCHECK_GT
#include "common/status.h"                            // for Status, WARN_UN...
#include "gen-cpp/control_service.pb.h"               // for ExecQueryFInsta...
#include "gutil/macros.h"                             // for DISALLOW_COPY_A...
#include "util/spinlock.h"                            // for SpinLock
namespace impala { class DescriptorTbl; }
namespace impala { class PublishFilterParamsPB; }
namespace impala { class RuntimeProfile; }
namespace impala { class TRuntimeProfileForest; }

be/src/runtime/query-state.h should remove these lines:
- #include <mutex>  // lines 22-22
- #include "gen-cpp/data_stream_service.pb.h"  // lines 29-29
- #include "util/container-util.h"  // lines 32-32
- #include "util/uid-util.h"  // lines 34-34
- namespace impala { class ReportExecStatusRequestPB; }  // lines 49-49
- namespace impala { class ThriftSerializer; }  // lines 53-53

The full include-list for be/src/runtime/query-state.h:
#include <cstdint>                                    // for int64_t, int32_t
#include <memory>                                     // for unique_ptr
#include <string>                                     // for string
#include <unordered_map>                              // for unordered_map
#include "common/atomic.h"                            // for AtomicInt32
#include "common/compiler-util.h"                     // for discard_result
#include "common/logging.h"                           // for DCHECK_GT
#include "common/object-pool.h"                       // for ObjectPool
#include "common/status.h"                            // for Status, WARN_UN...
#include "gen-cpp/ImpalaInternalService_types.h"      // for TQueryCtx, TCli...
#include "gen-cpp/Types_types.h"                      // for TUniqueId
#include "gen-cpp/control_service.pb.h"               // for ExecQueryFInsta...
#include "gutil/macros.h"                             // for DISALLOW_COPY_A...
#include "gutil/threading/thread_collision_warner.h"  // for DFAKE_MUTEX
#include "runtime/tmp-file-mgr.h"                     // for TmpFileMgr
#include "util/counting-barrier.h"                    // for CountingBarrier
#include "util/spinlock.h"                            // for SpinLock
namespace impala { class ControlServiceProxy; }  // lines 44-44
namespace impala { class DescriptorTbl; }
namespace impala { class FragmentInstanceState; }  // lines 46-46
namespace impala { class InitialReservations; }  // lines 47-47
namespace impala { class MemTracker; }  // lines 48-48
namespace impala { class PublishFilterParamsPB; }
namespace impala { class ReservationTracker; }  // lines 50-50
namespace impala { class RuntimeFilterBank; }  // lines 45-45
namespace impala { class RuntimeProfile; }
namespace impala { class RuntimeState; }  // lines 51-51
namespace impala { class ScannerMemLimiter; }  // lines 52-52
namespace impala { class TRuntimeProfileForest; }
namespace kudu { namespace rpc { class RpcContext; } }  // lines 38-38
{code}
A similar situation exists with statestore/statestore-subscriber.h because of 
StatestoreSubscriber::TopicDeltaMap being referenced in various headers.

> Investigate include-what-you-use to improve compile times
> ---------------------------------------------------------
>
>                 Key: IMPALA-9371
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9371
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Infrastructure
>            Reporter: Tim Armstrong
>            Priority: Major
>         Attachments: iwyu.sh, iwyu2.txt, iwyu4.txt, iwyu_mappings.imp
>
>
> Impala compile times are longer than necessary because .cc files pull in a 
> lot of headers, especially from the standard C++ library and boost. I looked 
> at the preprocessor output for a couple of files, and it included ~400,000 
> lines of code, almost entirely from headers.
> include-what-you-use is a tool that can identify cases where header includes 
> can be avoided or moved from .h to .cc files to reduce the number of 
> extraneous headers pulled into each compile unit.
> Kudu has some tooling to run IWYU in a dev environment and also in their 
> gerrit precommit.
> I was able to run iwyu against Impala as follows:
> {code:bash}
> # Clone repo and check out branch to match Impala toolchain's clang version
> git clone https://github.com/include-what-you-use/include-what-you-use.git
> cd include-what-you-use
> git checkout clang_5.0
> # Build IWYU against clang from toolchain
> cmake -G "Unix Makefiles" 
> -DIWYU_LLVM_ROOT_PATH=/opt/Impala-Toolchain/llvm-5.0.1-p2/ 
> -DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 ..
> make -j $(nproc)
> IWYU_BUILD_DIR=$(pwd)
> # Run IWYU on the impala codebase.
> cd $IMPALA_HOME
> # Do an ASAN build to force Impala to generate compile_commands.json with 
> compile commands suitable for clang.
> ./buildall.sh -noclean -notests -asan
> # Placeholder for additional arguments to the iwyu tool
> IWYU_ARGS=""
> PATH=$IWYU_BUILD_DIR:$PATH ~/repos/include-what-you-use/iwyu_tool.py -p . -j 
> $(nproc) -- $IWYU_ARGS > iwyu.txt
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to