Re: [lldb-dev] [cfe-dev] Plans for module debugging

Adrian Prantl Mon, 01 Dec 2014 11:09:06 -0800

> On Dec 1, 2014, at 10:50 AM, Ben Langmuir <blangm...@apple.com> wrote:
> 
> 
>> On Dec 1, 2014, at 10:41 AM, Adrian Prantl <apra...@apple.com 
>> <mailto:apra...@apple.com>> wrote:
>> 
>>> 
>>> On Dec 1, 2014, at 10:32 AM, Adrian Prantl <apra...@apple.com 
>>> <mailto:apra...@apple.com>> wrote:
>>> 
>>> 
>>>> On Dec 1, 2014, at 10:27 AM, Ben Langmuir <blangm...@apple.com 
>>>> <mailto:blangm...@apple.com>> wrote:
>>>> 
>>>> 
>>>>> On Nov 25, 2014, at 5:25 PM, Adrian Prantl <apra...@apple.com 
>>>>> <mailto:apra...@apple.com>> wrote:
>>>>> 
>>>>>> 
>>>>>> On Nov 24, 2014, at 4:55 PM, Richard Smith <rich...@metafoo.co.uk 
>>>>>> <mailto:rich...@metafoo.co.uk>> wrote:
>>>>>> 
>>>>>> On Fri, Nov 21, 2014 at 5:52 PM, Adrian Prantl <apra...@apple.com 
>>>>>> <mailto:apra...@apple.com>> wrote:
>>>>>> Plans for module debugging
>>>>>> ==========================
>>>>>> 
>>>>>> I recently had a chat with Eric Christopher and David Blaikie to discuss 
>>>>>> ideas for debug info for Clang modules and this is what we came up with.
>>>>>> 
>>>>>> Goals
>>>>>> -----
>>>>>> 
>>>>>> Clang modules [1], (and their siblings C++ modules and precompiled 
>>>>>> header files) are a method for improving compile time by making the 
>>>>>> serialized AST for commonly-used headers files directly available to the 
>>>>>> compiler.
>>>>>> 
>>>>>> Currently debug info is totally oblivious to this, when the developer 
>>>>>> compiles a file that uses a type from a module, clang simply emits a 
>>>>>> copy of the full definition (some exceptions apply for C++) of this type 
>>>>>> in DWARF into the debug info section of the resulting object file. 
>>>>>> That's a lot of copies.
>>>>>> 
>>>>>> The key idea is to emit DWARF for types defined in modules only once, 
>>>>>> and then only emit references to these types in all the individual 
>>>>>> compile units that import this module. We are going to build on the 
>>>>>> split DWARF and type unit facilities provided by DWARF for this. DWARF 
>>>>>> consumers can follow the type references into module debug info section 
>>>>>> quite similar to how they resolve types in external type units today. 
>>>>>> Additionally, the format will allow consumers that support clang modules 
>>>>>> natively (such as LLDB) to directly look up types in the module, without 
>>>>>> having to go through the usual translation from AST to DWARF and back to 
>>>>>> AST.
>>>>>> 
>>>>>> The primary benefit from doing all this is performance. This change is 
>>>>>> expected to reduce the size of the debug info in object files 
>>>>>> significantly by
>>>>>> - emitting only references to the full types and thus
>>>>>> - implicitly uniquing types that are defined in modules.
>>>>>> The smaller object files will result in faster compile times and faster 
>>>>>> llvm::Module load times when doing LTO. The type uniquing will also 
>>>>>> result in significantly smaller debug info for the finished executables, 
>>>>>> especially for C and Objective-C, which do not support ODR-based type 
>>>>>> uniquing. This comes at the price of longer initial module build times, 
>>>>>> as debug info is emitted alongside the module.
>>>>>> 
>>>>>> Design
>>>>>> ------
>>>>>> 
>>>>>> Clang modules are designed to be ephemeral build artifacts that live in 
>>>>>> a shared module cache. Compiling a source file that imports `MyModule` 
>>>>>> results in `Module.pcm` to be generated to the module cache directory, 
>>>>>> which contains the serialized AST of the declarations found in the 
>>>>>> header files that comprise the module.
>>>>>> 
>>>>>> We will change the binary clang module format to became a container 
>>>>>> (ELF, Mach-O, depending on the platform). Inside the container there 
>>>>>> will be multiple sections: one containing the serialized AST, and ones 
>>>>>> containing DWARF5 split debug type information for all types defined in 
>>>>>> the module that can be encoded in DWARF. By virtue of using type units, 
>>>>>> each type is emitted into its own type unit which can be identified via 
>>>>>> a unique type signature. DWARF consumers can use the type signatures to 
>>>>>> look up type definitions in the module debug info section. For 
>>>>>> module-aware consumers (LLDB), we will add an index that maps type 
>>>>>> signatures directly to an offset in the AST section.
>>>>>> 
>>>>>> For an object file that was built using modules, we need to record the 
>>>>>> fact that a module has been imported. To this end, we add a 
>>>>>> DW_TAG_compile_unit into a COMDAT .debug_info.dwo section that 
>>>>>> references the split DWARF inside the module. Similar to split DWARF 
>>>>>> objects, the module will be identified by its filename and a checksum. 
>>>>>> The imported unit also contains a couple of extra attributes holding all 
>>>>>> the information necessary to recreate the module in case the module 
>>>>>> cache has been flushed.
>>>>>> 
>>>>>> How does the debugging experience work in this case? When do you trigger 
>>>>>> the (possibly-lengthy) rebuild of the source in order to recreate the 
>>>>>> DWARF for the module (is it possible to delay that until the information 
>>>>>> is needed)?
>>>>> 
>>>>> The module debugging scenario is primarily aimed at providing a 
>>>>> better/faster edit-compile-debug cycle. In this scenario, the module 
>>>>> would most likely still be in the cache. In a case were the binary was 
>>>>> build so long ago that the module cache has since been flushed it is 
>>>>> generally more likely the the user also used a DWARF linking step (such 
>>>>> as dsymutil on Darwin, and maybe dwz on Linux?) because they did a 
>>>>> release/archive build which would just copy the DWARF out of the module 
>>>>> and store it alongside the binary. For this reason I’m not very concerned 
>>>>> about the time necessary for rebuilding the module. But this is all very 
>>>>> platform-specific, and different platforms may need different defaults.
>>>> 
>>>> This description is in terms of building a module that has gone missing, 
>>>> but just to be clear: a modules-aware debugger probably also needs to 
>>>> rebuild modules that have gone out of date, such as when one of their 
>>>> headers is modified.
>>> 
>>> In this case were the module is out of date, the debugger should probably 
>>> fall back to the DWARF types, because it cannot guarantee that the 
>>> modifications to the header files did not change the types it wants to look 
>>> up.
>> 
>> Sorry, I just realized that this doesn’t make any sense if the DWARF is 
>> stored in the module. The behavior should be:
>> 1. If the module is missing, recreate the module.
>> 2. If the module signature does not match the signature in the .o file, 
>> either print a large warning that types from that module may be bogus, or 
>> categorically refuse to use them.
> 
> Maybe this is described elsewhere, but what is the “signature” being used 
> here?  Assuming it depends on the detailed contents of the serialized AST: 
> currently ASTWriter output is nondeterministic and things like the ID#s for 
> identifiers, types, etc. will change every time you build the module; until 
> that gets fixed, we would always hit case (2).


I was actually hoping that we could rely on deterministic output from clang. If 
it is infeasible make ASTWriter output deterministic, we can fall back to 
something like the DWARF dwo_id signature here.

-- adrian

> 
>> 
>> For long-term debugging users are expected to use a DWARF linker (dsymutil, 
>> dwz), which archives all types in a future-proof format (DWARF).
>> 
>> -- adrian
>> 
>>> 
>>>> 
>>>>> Delaying the module DWARF output until needed (maybe even by the 
>>>>> debugger!) is an interesting idea. We should definitely measure how 
>>>>> expensive it is to emit DWARF for an entire module with of types to see 
>>>>> if this is worthwhile.
>>>>> 
>>>>>> How much knowledge does the debugger have/need of Clang's modules to do 
>>>>>> this? Are we just embedding an arbitrary command that can be run to 
>>>>>> rebuild the .dwo if it's missing? And if so, how do we make that safe 
>>>>>> when (say) root attaches a debugger to an arbitrary process?
>>>>> 
>>>>> I think it is reasonable to assume that a consumer that can make use of 
>>>>> clang modules also knows how to rebuild clang modules, which is why the 
>>>>> example only contained the name of the module, sysroot, include path, and 
>>>>> defines; not an arbitrary command. On platforms were the debugger does 
>>>>> not understand clang modules, the whole problem can be dodged by treating 
>>>>> the modules as explicit build artifacts.
>>>> 
>>>> You are probably already aware, but you will need a bunch more information 
>>>> (language options, target options, header search options) to rebuild a 
>>>> module.
>>> 
>>> Thanks, language options and target options were absent from the list 
>>> previously!
>>> 
>>> -- adrian
>>>> 
>>>>> 
>>>>>> 
>>>>>> Platforms that treat modules as an explicit build artifact do not have 
>>>>>> this problem. In the .debug_info section all types that are defined in 
>>>>>> the module are referenced via their unique type signature using 
>>>>>> DW_FORM_ref_sig8, just as they would be if this were types from a 
>>>>>> regular DWARF type unit.
>>>>>> 
>>>>>> Example
>>>>>> -------
>>>>>> 
>>>>>> Let's say we have a module `MyModule` that defines a type `MyStruct`::
>>>>>>  $ cat foo.c
>>>>>>  #include <MyModule.h>
>>>>>>  MyStruct x;
>>>>>> 
>>>>>> when compiling `foo.c` like this::
>>>>>>  clang -fmodules -gmodules foo.c -c
>>>>>> 
>>>>>> clang produces `foo.o` and an ELF or Mach-O container for the module::
>>>>>>  /path/to/module-cache/MyModule.pcm
>>>>>> 
>>>>>> In the module container, we have a section for the serialized AST and a 
>>>>>> split DWARF sections for the debug type info. The exact format is likely 
>>>>>> still going to evolve a little, but this should give a rough idea::
>>>>>> 
>>>>>>  MyModule.pcm:
>>>>>>   .debug_info.dwo:
>>>>>>     DW_TAG_compile_unit
>>>>>>       DW_AT_dwo_name ("/path/to/MyModule.pcm")
>>>>>>       DW_AT_dwo_id   ([unique AST signature])
>>>>>> 
>>>>>>     DW_TAG_type_unit ([hash for MyStruct])
>>>>>>        DW_TAG_structure_type
>>>>>>           DW_AT_signature ([hash for MyStruct])
>>>>>>           DW_AT_name “MyStruct”
>>>>>>           ...
>>>>>> 
>>>>>>   .debug_abbrev.dwo:
>>>>>>     // abbrevs referenced by .debug_info.dwo
>>>>>>   .debug_line.dwo:
>>>>>>     // filenames referenced by .debug_info.dwo
>>>>>>   .debug_str.dwo:
>>>>>>     // strings referenced by .debug_info.dwo
>>>>>> 
>>>>>>   .ast
>>>>>>     // Index at the top of the AST section sorted by hash value.
>>>>>>     [hash for MyStruct] -> [offset for MyStruct in this section]
>>>>>>     ...
>>>>>>     // Serialized AST follows
>>>>>>     ...
>>>>>> 
>>>>>> The debug info in foo.o will look like this::
>>>>>> 
>>>>>>  .debug_info.dwo
>>>>>>    DW_TAG_compile_unit
>>>>>>       // For DWARF consumers
>>>>>>       DW_AT_dwo_name ("/path/to/module-cache/MyModule.pcm")
>>>>>>       DW_AT_dwo_id   ([unique AST signature])
>>>>>> 
>>>>>>       // For LLDB / dsymutil so they can recreate the module
>>>>>>       DW_AT_name “MyModule"
>>>>>>       DW_AT_LLVM_system_root "/"
>>>>>>       DW_AT_LLVM_preprocessor_defines  "-DNDEBUG"
>>>>>>       DW_AT_LLVM_include_path "/path/to/MyModule.map"
>>>>>> 
>>>>>>  .debug_info
>>>>>>    DW_TAG_compile_unit
>>>>>>      DW_TAG_variable
>>>>>>        DW_AT_name "x"
>>>>>>        DW_AT_type (DW_FORM_ref_sig8) ([hash for MyStruct])
>>>>>> 
>>>>>> 
>>>>>> Type signatures
>>>>>> ---------------
>>>>>> 
>>>>>> We are going to deviate from the DWARF spec by using a more efficient 
>>>>>> hashing function that uses the type's unique mangled name and the name 
>>>>>> of the module as input.
>>>>>> 
>>>>>> Why do you need/want the name of the module here? Modules are not a 
>>>>>> namespacing mechanism. How would you compute this name when the same 
>>>>>> type is defined in multiple imported modules?
>>>>> 
>>>>> Great point! I’m mostly concerned about non-ODR languages ...
>>>>>> 
>>>>>> For languages that do not have mangled type names or an ODR,
>>>>>> 
>>>>>> The people working on C modules have expressed an intent to apply the 
>>>>>> ODR there too, so it's not clear that Clang modules will support any 
>>>>>> such language in the longer term.
>>>>> 
>>>>> ... and this may be the answer to the question!
>>>>> 
>>>>> +Doug: do Objective-C modules have an ODR?
>>>>> 
>>>>>> 
>>>>>> we will use the unique identifiers produces by the clang indexer (USRs) 
>>>>>> as input instead.
>>>>>> 
>>>>>> Extension: Replacing type units with a more efficient storage format
>>>>>> --------------------------------------------------------------------
>>>>>> 
>>>>>> As an extension to this proposal, we are thinking of replacing the type 
>>>>>> units within the module debug info with a more efficient format: Instead 
>>>>>> of emitting each type into its own type unit (complete with its entire 
>>>>>> declcontext), it would be much more more efficient to emit one large bag 
>>>>>> of DWARF together with an index that maps hash values (type signatures) 
>>>>>> to DIE offsets.
>>>>>> 
>>>>>> Next steps
>>>>>> ----------
>>>>>> 
>>>>>> In order to implement this, the next steps would be as follows:
>>>>>> 1. Change the clang module format to be an ELF/Mach-O container.
>>>>>> 2. Teach clang to emit debug info for module types (e.g., by passing an 
>>>>>> empty compile unit with retained types to LLVM) into the module 
>>>>>> container.
>>>>>> 3a. Add a -gmodules switch to clang that triggers the emission of type 
>>>>>> signatures for types coming from a module.
>>>>>> 
>>>>>> Can you clarify what this flag would do? Does this turn on adding DWARF 
>>>>>> to the .pcm file? Does it turn off generating DWARF for imported modules 
>>>>>> in the current IR module? Both?
>>>>> 
>>>>> It would emit references to the type from imported modules instead of the 
>>>>> types themselves.
>>>>> Since the module cache is shared, we could — depending on just expensive 
>>>>> this is — turn on DWARF generation for .pcm files by default. I’d like to 
>>>>> measure this first, though.
>>>>> 
>>>>>> 
>>>>>> I assume this means that the default remains that we build debug 
>>>>>> information for modules as if we didn't have modules (that is, put 
>>>>>> complete DWARF with the object code). Do you think that's the right 
>>>>>> long-term default? I think it's possibly not.
>>>>> 
>>>>> I think you’re absolutely right about the long term. In the short term, 
>>>>> it may be better to have compatibility by default, but I don’t know what 
>>>>> the official LLVM policy on new features is, if there is one.
>>>>> 
>>>>>> 
>>>>>> How does this interact with explicit module builds? Can I use a module 
>>>>>> built without -g in a compile that uses -g? And if I do, do I get 
>>>>>> complete debug information, or debug info just for the parts that aren't 
>>>>>> in the module? Does -gmodules let me choose between these?
>>>>> 
>>>>> Personally I would expect old-style (full copy of the types) debug 
>>>>> information if I build agains a module that does not have embedded debug 
>>>>> information.
>>>>> 
>>>>> thanks,
>>>>> adrian
>>>>>> 
>>>>>> 3b. Implement type-signature-based lookup in llvm-dsymutil and lldb.
>>>>>> 4a. Emit an index that maps type signatures to AST section offsets into 
>>>>>> the module container.
>>>>>> 4b. Implement direct loading of AST types in lldb.
>>>>>> 5a. Improve the efficiency by replace type units in the module debug 
>>>>>> info with a lookup table that maps type signatures to DIE offsets.
>>>>>> 5b. Support this format in lldb and llvm-dsymutil.
>>>>>> 
>>>>>> Let me know what you think!
>>>>>> 
>>>>>> cheers,
>>>>>> Adrian
>>>>>> 
>>>>>> [1] For more details about clang modules see
>>>>>> http://clang.llvm.org/docs/Modules.html 
>>>>>> <http://clang.llvm.org/docs/Modules.html> and
>>>>>> http://clang.llvm.org/docs/PCHInternals.html 
>>>>>> <http://clang.llvm.org/docs/PCHInternals.html>
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> cfe-dev mailing list
>>>>>> cfe-...@cs.uiuc.edu <mailto:cfe-...@cs.uiuc.edu>
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev 
>>>>>> <http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev>
>>>>> _______________________________________________
>>>>> cfe-dev mailing list
>>>>> cfe-...@cs.uiuc.edu <mailto:cfe-...@cs.uiuc.edu>
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev 
>>>>> <http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev>

_______________________________________________
lldb-dev mailing list
lldb-dev@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev

Re: [lldb-dev] [cfe-dev] Plans for module debugging

Reply via email to