Yeah, I did notice that, but I think it makes more sense from our point of view to worry about supporting what compilers are generating currently, rather than wait for support for a standard that doesn't even exist yet.

Jason Molenda wrote:
It's worth noting that there is a proposal in progress in the DWARF standards 
committee to add these new accelerator tables to the next version of the 
standard.  There is widespread agreement that the current accelerator tables 
are not sufficient for what a debugger needs -- or even specified well enough 
that debuggers can know what will be in them from different compilers -- so 
hopefully some form of these tables may be accepted in to the standard.

On Aug 30, 2013, at 4:08 PM, Greg Clayton<[email protected]>  wrote:

At Apple, when we have our own accelerator tables and we partially parse everything. We know our accelerator 
tables are good and actually useful. The DWARF tables like ".debug_pubnames" only shows public 
names which makes it useless. DWARF ".debug_aranges" is often incomplete or possibly missing like 
you said below, and DWARF "accelerator" tables (yes I put the quotes on there for a reason), aren't 
really great because they force you to parse them and the data contained within then is randomly ordered, so 
to make them useful, you must parse them and sort them so they are useable. So we at Apple made accelerator 
tables that actually work. The format of these accelerator tables is well defined and available as a 
specification in the LLVM/Clang sources. They work well and stop us from having to manually index (like all 
targets except Darwin) the DWARF to make sure we have the truth.

I want to clarify a few things from what was stated... When we parse DIEs, make 
an DIE array per compile unit, though each DIE is very small: 4 uint32_t 
objects per DIE. I have a way to reduce this even further to just 2 uint32_t 
objects. So we are only storing the DIE offsets (32 bits each) and any data 
necessary to be able to find the first child, sibling, or parent DIE offset 
using the current DIE. So this data isn't all that huge. This data then can be 
used to navigate the DWARF.


On Aug 30, 2013, at 1:27 PM, Richard Mitton<[email protected]>  wrote:

DWARF parsing is currently very slow. On my machine, loading the 'clang' binary 
into lldb takes 14 seconds (vs. gdb's 29 seconds). The actual I/O cost of 
reading that much data is only around 2 seconds.

The DWARF parser has to read the entire .debug_info section into memory and 
parse it. It would be great if we didn't have to do this. OS X has the 
.apple_names section et al, which allow lldb to automatically have an index 
without having to parse anything.

However, this is an Apple extension and does not exist on other platforms. 
There are a bunch of accelerator tables that the DWARF spec allows for, but 
they're all unusable. pubnames is either absent or incomplete, aranges might 
not be present, and if they were generated by the compiler rather than the 
linker, then they may not cover all the object files in the binary (causing 
lldb to miss symbols). So we end up in the situation we have now, where we 
cannot use or trust the accelerators, and have to parse everything anyway to 
build our own index.

I believe lldb does this the wrong way. Performing just a simple "b main" test, 
it will touch the entire debug_info section *4 times* (at least), which on the clang 
binary example is 600MB of data each time:

- pass 1 (extract DIEs)
- pass 2 (index DIEs)
- pass 3 (extract DIEs)
- pass 4 (build arange tables)

I believe the key problems of the current design are:
   1) lldb tries to build it's own DIE array copy, rather than just referring 
to the existing data in-place. This adds a significant management overhead to 
all functions accessing this data.
We do refer to the data in place, our DIE is actually just the DIE offset of 
the DIE itself within the .debug_info section (32 bits), parent and sibling 
indexes (32 bits each), and some state that is stored in each DIE because the 
compiler will pad the DIE anyway. So the total for each DIE is 128 bits. We do 
not store _anything_ else. When we need to access the data in each DIE, we go 
to the DIE offset of the DIE and starting parsing the information we need, then 
we throw that information away.

   2) lldb goes to great efforts to avoid reading the entire debug information 
(even though it will ultimately need to anyway) and to avoid keeping it in 
memory. This in fact causes it to *reload* it several times, as each further 
operation performs lazy initialization and causes a re-parse.
This can be easily fixed by making sure when we scan it the first time that we 
generate any needed accelerator tables at that time to avoid any re-parsing. 
Clang and its 600MB is nothing compared to debugging WebKit with 1.8GB of debug 
info, or debugging Xcode with 200 shared libraries that all have debug info. So 
we need to free up this memory, or LLDB ends up taking up too much memory when 
debugging applications with lots of debug info.

If we just accepted that we are forced to load all the data once, it would 
actually be faster. My suggestion therefore is to write an optimized 
single-pass DWARF indexer to replace the current DWARF loader, with the 
following properties:

- Always read debug info exactly once upon module load (unless we can guarantee 
apple extensions are used).
That sounds nice in theory, but when you parse the .debug_info, what would you 
store in order to be able to navigate the DWARF (parent, child, sibling)? At 
least you will need to store the DIE offset of each DIE in an array. That is at 
least 32 bits. I have a solution that can get the DWARFDebugInfoEntry down to 2 
32 bit values: the DIE offset, and the depth. Then each DIE is only 64 bits. I 
haven't ported this into LLDB yet, but I can do this.

- Use the entire debug_info section in-place, without trying to build a copy.
We do use the entire .debug_info section in place. We need the DIE offsets at 
the very least for all DIEs in a compile unit. If you think we don't, I would 
like to hear how you plan to do this.

Not having separate stages for extraction and indexing will allow efficient 
data traversal.
This can be fixed in the current situation be determining if we need to 
generate the accelerator tables and making sure we only traverse the DWARF 
once. On MacOSX, with the accelerator tables, we don't need to traverse any 
DWARF, except for the compile units we find through our accelerator table 
lookups.

- Make use of the abbreviation tables to pre-build decoder structures for each 
DWARF tag type. In most cases we can know the size of each DIE the moment we 
read it's abbreviation code in, and can skip in one operation if needed without 
having to parse the elements. Because we run in one pass, we never have to even 
look at DIEs we don't need.
I added this feature a while back thinking it was going to change the world and 
make parsing much much faster and it didn't on MacOSX. So I never checked it 
in. This is a good thing to do though as about 60-70% of DIEs have fixed sizes 
and it might make non MacOSX parsing faster when you do need to parse 
everything.

- Track the parent scope as we go, on a stack, so we don't have to keep doing 
lookups which walk the DIE tree. The current parser walks up the tree to find 
what scope it's in, even though it already parsed the parent scope container.
Walking the tree is done all over the place, so any solution needs to be able to do the same thing. 
Walking the tree is not expensive and is very fast right now as each die knows it is store in an 
std:vector<>  and it can calculate the parent very efficiently by returning "this - 
m_parent_idx", and its sibling by doing "this + m_sibling_idx" (there is a check to make 
sure the indexes are valid).

I don't see how you would make navigating DWARF efficient by always just 
reading from the .debug_info section with nothing else being stored.

- Build arange tables automatically as we go, ignoring any that might already 
be present. We have already touched and extracted the range data anyway, it 
would be trivial to build an accelerator table for free.
That can just be combined with the initial parsing, but only if needed.

- For strings, we should pre-pool the DWARF string table once up-front, to 
avoid repeatedly pooling strings for each DIE.
This can be done, but we often don't make all DWARF strings into 
lldb_private::ConstString values unless we actually parse the information. This will 
definitely bloat memory, but we would need to see by how much before we can say it 
works. Then we need to make a map that maps 32 bit string offsets to ConstString 
values. I am sure we can use llvm::DenseMap<>.
With this approach we use the DIE data as-is in memory, without having to make 
our own copy.
Again, we don't make a copy. What are we making a copy of? Each DIE is 4 32 bit 
values, I can reduce it to 2 32 bit values. I worry navigation will be slow if 
we don't store anything for DIEs, but I would be happy to see your 
implementation.

Parent chains should ideally only be used during parsing. If parents/siblings 
are really needed after the initial parse, one easy solution would be to just 
store that in a separate hash table.
If you just store the following for each DIE:

class DWARFDebugInfoEntry {
    uint32_t m_offset;
    uint32_t m_depth;
}

Then you can reduce memory usage at the expense of having navigations (get 
parent/child/sibling) be slower.

I welcome discussion on this. I think it's important for lldb to not have any 
delays on loading programs, and as we cannot control what the compilers will 
supply to us, we have to address this on our end.
I think a good start is to just parse the accelerator tables during the initial 
parsing and making sure we only touch the .debug_info once for indexing for 
name and address, throw it all away (save memory) and only partially parse what 
we need to when we need to after that, this would get us some performance 
benefit.

Any solution _must_ be able to do the following:
- Not hog memory
- Be able to quickly navigate parent/child/sibling
- Quickly find any DIE by DIE offset
- partially parse

The solution we came up with now relies heavily on the Apple DWARF accelerator 
tables to get its performance. Back before we have these tables we did index 
everything and we ran extensive memory performance tools on LLDB to stop it 
from using too much memory.

I am all for improving performance, but we don't need all DWARF parsed up front. The 
Apple accelerator tables work great, and they can easily be enabled non-darwin copies of 
clang, and in other compilers. The Apple accelerator tables are being proposed to become 
the standard DWARF accelerator tables for DWARF 5. They are up agains the GDB index 
tables which to us are not that useful as the table data says "foo" is in 
compile unit 0x12232000. They don't specify the actual DIE within a compile unit...

So wrapping things up
1 - we should fix the multiple .debug_info pass issue when we need to generate 
the index tables as this is lame. We need to parse the accelerator tables, only 
if they need to be generated, when doing the initial parse. This initial parse 
doesn't need to happen if we have the Apple accelerator tables. We also need to 
parse all needed stuff for address and names in this initial pass, again only 
if needed
2 - Feel free to add code to the DWARFAbbreviationDeclaration() ask it for its 
byte size, and if it returns non-zero, then it we can just skip the entire DIE 
data when doing the initial parse, unless of course we need to index. If it 
returns zero, the DIE will need to be parsed one attribute/form at a time.

_______________________________________________
lldb-dev mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev

_______________________________________________
lldb-dev mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev

Reply via email to