At Apple, when we have our own accelerator tables and we partially parse everything. We know our accelerator
tables are good and actually useful. The DWARF tables like ".debug_pubnames" only shows public
names which makes it useless. DWARF ".debug_aranges" is often incomplete or possibly missing like
you said below, and DWARF "accelerator" tables (yes I put the quotes on there for a reason), aren't
really great because they force you to parse them and the data contained within then is randomly ordered, so
to make them useful, you must parse them and sort them so they are useable. So we at Apple made accelerator
tables that actually work. The format of these accelerator tables is well defined and available as a
specification in the LLVM/Clang sources. They work well and stop us from having to manually index (like all
targets except Darwin) the DWARF to make sure we have the truth.
I want to clarify a few things from what was stated... When we parse DIEs, make
an DIE array per compile unit, though each DIE is very small: 4 uint32_t
objects per DIE. I have a way to reduce this even further to just 2 uint32_t
objects. So we are only storing the DIE offsets (32 bits each) and any data
necessary to be able to find the first child, sibling, or parent DIE offset
using the current DIE. So this data isn't all that huge. This data then can be
used to navigate the DWARF.
On Aug 30, 2013, at 1:27 PM, Richard Mitton<[email protected]> wrote:
DWARF parsing is currently very slow. On my machine, loading the 'clang' binary
into lldb takes 14 seconds (vs. gdb's 29 seconds). The actual I/O cost of
reading that much data is only around 2 seconds.
The DWARF parser has to read the entire .debug_info section into memory and
parse it. It would be great if we didn't have to do this. OS X has the
.apple_names section et al, which allow lldb to automatically have an index
without having to parse anything.
However, this is an Apple extension and does not exist on other platforms.
There are a bunch of accelerator tables that the DWARF spec allows for, but
they're all unusable. pubnames is either absent or incomplete, aranges might
not be present, and if they were generated by the compiler rather than the
linker, then they may not cover all the object files in the binary (causing
lldb to miss symbols). So we end up in the situation we have now, where we
cannot use or trust the accelerators, and have to parse everything anyway to
build our own index.
I believe lldb does this the wrong way. Performing just a simple "b main" test,
it will touch the entire debug_info section *4 times* (at least), which on the clang
binary example is 600MB of data each time:
- pass 1 (extract DIEs)
- pass 2 (index DIEs)
- pass 3 (extract DIEs)
- pass 4 (build arange tables)
I believe the key problems of the current design are:
1) lldb tries to build it's own DIE array copy, rather than just referring
to the existing data in-place. This adds a significant management overhead to
all functions accessing this data.
We do refer to the data in place, our DIE is actually just the DIE offset of
the DIE itself within the .debug_info section (32 bits), parent and sibling
indexes (32 bits each), and some state that is stored in each DIE because the
compiler will pad the DIE anyway. So the total for each DIE is 128 bits. We do
not store _anything_ else. When we need to access the data in each DIE, we go
to the DIE offset of the DIE and starting parsing the information we need, then
we throw that information away.
2) lldb goes to great efforts to avoid reading the entire debug information
(even though it will ultimately need to anyway) and to avoid keeping it in
memory. This in fact causes it to *reload* it several times, as each further
operation performs lazy initialization and causes a re-parse.
This can be easily fixed by making sure when we scan it the first time that we
generate any needed accelerator tables at that time to avoid any re-parsing.
Clang and its 600MB is nothing compared to debugging WebKit with 1.8GB of debug
info, or debugging Xcode with 200 shared libraries that all have debug info. So
we need to free up this memory, or LLDB ends up taking up too much memory when
debugging applications with lots of debug info.
If we just accepted that we are forced to load all the data once, it would
actually be faster. My suggestion therefore is to write an optimized
single-pass DWARF indexer to replace the current DWARF loader, with the
following properties:
- Always read debug info exactly once upon module load (unless we can guarantee
apple extensions are used).
That sounds nice in theory, but when you parse the .debug_info, what would you
store in order to be able to navigate the DWARF (parent, child, sibling)? At
least you will need to store the DIE offset of each DIE in an array. That is at
least 32 bits. I have a solution that can get the DWARFDebugInfoEntry down to 2
32 bit values: the DIE offset, and the depth. Then each DIE is only 64 bits. I
haven't ported this into LLDB yet, but I can do this.
- Use the entire debug_info section in-place, without trying to build a copy.
We do use the entire .debug_info section in place. We need the DIE offsets at
the very least for all DIEs in a compile unit. If you think we don't, I would
like to hear how you plan to do this.
Not having separate stages for extraction and indexing will allow efficient
data traversal.
This can be fixed in the current situation be determining if we need to
generate the accelerator tables and making sure we only traverse the DWARF
once. On MacOSX, with the accelerator tables, we don't need to traverse any
DWARF, except for the compile units we find through our accelerator table
lookups.
- Make use of the abbreviation tables to pre-build decoder structures for each
DWARF tag type. In most cases we can know the size of each DIE the moment we
read it's abbreviation code in, and can skip in one operation if needed without
having to parse the elements. Because we run in one pass, we never have to even
look at DIEs we don't need.
I added this feature a while back thinking it was going to change the world and
make parsing much much faster and it didn't on MacOSX. So I never checked it
in. This is a good thing to do though as about 60-70% of DIEs have fixed sizes
and it might make non MacOSX parsing faster when you do need to parse
everything.
- Track the parent scope as we go, on a stack, so we don't have to keep doing
lookups which walk the DIE tree. The current parser walks up the tree to find
what scope it's in, even though it already parsed the parent scope container.
Walking the tree is done all over the place, so any solution needs to be able to do the same thing.
Walking the tree is not expensive and is very fast right now as each die knows it is store in an
std:vector<> and it can calculate the parent very efficiently by returning "this -
m_parent_idx", and its sibling by doing "this + m_sibling_idx" (there is a check to make
sure the indexes are valid).
I don't see how you would make navigating DWARF efficient by always just
reading from the .debug_info section with nothing else being stored.
- Build arange tables automatically as we go, ignoring any that might already
be present. We have already touched and extracted the range data anyway, it
would be trivial to build an accelerator table for free.
That can just be combined with the initial parsing, but only if needed.
- For strings, we should pre-pool the DWARF string table once up-front, to
avoid repeatedly pooling strings for each DIE.
This can be done, but we often don't make all DWARF strings into
lldb_private::ConstString values unless we actually parse the information. This will
definitely bloat memory, but we would need to see by how much before we can say it
works. Then we need to make a map that maps 32 bit string offsets to ConstString
values. I am sure we can use llvm::DenseMap<>.
With this approach we use the DIE data as-is in memory, without having to make
our own copy.
Again, we don't make a copy. What are we making a copy of? Each DIE is 4 32 bit
values, I can reduce it to 2 32 bit values. I worry navigation will be slow if
we don't store anything for DIEs, but I would be happy to see your
implementation.
Parent chains should ideally only be used during parsing. If parents/siblings
are really needed after the initial parse, one easy solution would be to just
store that in a separate hash table.
If you just store the following for each DIE:
class DWARFDebugInfoEntry {
uint32_t m_offset;
uint32_t m_depth;
}
Then you can reduce memory usage at the expense of having navigations (get
parent/child/sibling) be slower.
I welcome discussion on this. I think it's important for lldb to not have any
delays on loading programs, and as we cannot control what the compilers will
supply to us, we have to address this on our end.
I think a good start is to just parse the accelerator tables during the initial
parsing and making sure we only touch the .debug_info once for indexing for
name and address, throw it all away (save memory) and only partially parse what
we need to when we need to after that, this would get us some performance
benefit.
Any solution _must_ be able to do the following:
- Not hog memory
- Be able to quickly navigate parent/child/sibling
- Quickly find any DIE by DIE offset
- partially parse
The solution we came up with now relies heavily on the Apple DWARF accelerator
tables to get its performance. Back before we have these tables we did index
everything and we ran extensive memory performance tools on LLDB to stop it
from using too much memory.
I am all for improving performance, but we don't need all DWARF parsed up front. The
Apple accelerator tables work great, and they can easily be enabled non-darwin copies of
clang, and in other compilers. The Apple accelerator tables are being proposed to become
the standard DWARF accelerator tables for DWARF 5. They are up agains the GDB index
tables which to us are not that useful as the table data says "foo" is in
compile unit 0x12232000. They don't specify the actual DIE within a compile unit...
So wrapping things up
1 - we should fix the multiple .debug_info pass issue when we need to generate
the index tables as this is lame. We need to parse the accelerator tables, only
if they need to be generated, when doing the initial parse. This initial parse
doesn't need to happen if we have the Apple accelerator tables. We also need to
parse all needed stuff for address and names in this initial pass, again only
if needed
2 - Feel free to add code to the DWARFAbbreviationDeclaration() ask it for its
byte size, and if it returns non-zero, then it we can just skip the entire DIE
data when doing the initial parse, unless of course we need to index. If it
returns zero, the DIE will need to be parsed one attribute/form at a time.
_______________________________________________
lldb-dev mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev