Re: [Lldb-commits] [PATCH] D37651: Fix for bug 34532 - A few rough corners related to post-mortem debugging (core/minidump)

Greg Clayton via lldb-commits Mon, 11 Sep 2017 15:32:08 -0700

I know there are two points of view here so I will give my two cents:

If it comes down to something as easy as "always check for NULL before doing 
something", or something that is similar and very easy, I would say just code 
around it. Don't "assert(...)" or "llvm_unreachable(...)". Things that fall 
into this category: switch statements with defaults that might catch things, 
checking var/ivar for NULL, seeing if a collection is empty, bounds checking 
things.


One example here is LLDB has its own object file readers for mach-o and ELF. 
Mach-o has load commands in the file header and each load command has a "cmd" 
uint32_t whose value is an enum. It is followed by a uint32_t "length" of this 
data. The LLVM version of the mach-o file parser, a while back before changes 
were made, wanted to assert or llvm_unreachable for any load command that it 
didn't recognize. This kind of thing can't happen in a debugger. We will load 
files in the future that weren't made with the exact version of tools used for 
LLDB. We can't assert and die saying "I don't know what this blob of stuff is". 
This is a very easy call to make, since if we don't understand what the command 
is, we can skip it because there is a length. 

Same goes for DWARF: we can't assert if we don't recognize a DWARF tag or 
attribute. DW_FORM_XXXX values that the debugger doesn't recognized might 
currently cause us to crash as this stops us from being able to parse the file. 
I would say this shouldn't crash us. We should say "I don't know how to parse 
this debug information". There might be many other debug files we can parse.

Just because Xcode runs LLDB out of process, doesn't mean that CodeLion, 
Nuclide, AppCode and many other IDE's that link against a native LLDB.framework 
or liblldb.so don't matter. When you run in this environment, you don't control 
your heap as the LLDB shared library exists in the heap along with all of the 
other code in the current applicaiton. The heap can get messed up sometimes. 
Once your heap gets messed up, it is nice to try and survive if you can and if 
it is easy. So just saying "I did an assert in my implementation because I 
didn't want to check if something was NULL" seems too easy to fix. 

Also, when you put an assert in your implementation, you are assuming that 
people will read through all of the implementation code before using it? I 
usually just looking at the header file. Often things are not documented in 
header files saying "don't call this function with NULL". We just happen to run 
into it later when NULL happens to be passed to a function and we crash.

So I say make your code bullet proof when it is easy. Stop the crashes that 
aren't necessary if at all possible.


Following up on the approach to fix resuming core files from earlier:

>>> Going back to the original fix, I think most observations are very 
>>> pertinent - yes, the result is not ideal. This is because the current 
>>> logic, changing the state in Process::Resume() before the whole operation 
>>> completes successfully is ... unfortunate. So here are the options I see:
>>> 
>>> 1. Fix Process::Resume() so we only change the state late, after everything 
>>> succeds
>>> 2. Rollback state change if anything fails while resuming
>>> 3. Patch the specific case of "resume is not supported"
>>> 4. Do nothing

Process already has "Error Process::WillResume()" for this very reason. Just 
have the Windows mini dump core file plug-in override it. The code in 
Process::Resume already checks this:

Status Process::PrivateResume() {
  Status error(WillResume());
  // Tell the process it is about to resume before the thread list
  if (error.Success()) {
      ...
      SetPrivateState(eStateRunning);
      SetPrivateState(eStateStopped);
    }
  }
  return error;
}

Problem solved no???


> On Sep 11, 2017, at 2:57 PM, Jim Ingham via lldb-commits 
> <lldb-commits@lists.llvm.org> wrote:
> 
> 
>> On Sep 11, 2017, at 12:24 PM, Leonard Mosescu <mose...@google.com> wrote:
>> 
>> Thanks Jim, I'll give option #1 a try and upload a new patch.
>> 
>> think your divisions into options is too simplistic.  lldb is built of many 
>> different subsystems, and a lot of "inconsistent state" can be confined to 
>> one subsystem or other.  So for instance, if one CU's debug is unrecoverably 
>> damaged, we can just pretend that that .o file was built w/o debug info and 
>> the user can do lots of useful debugging still.  It is much better if we can 
>> say "Found irrecoverable errors in this CU, dumping debug info" and continue 
>> rather than crashing because one CU that the user may not even care about is 
>> all horked up.  Similarly a single JIT execution can get itself all balled 
>> up and yet if we can discard the work it did we can continue doing lots of 
>> other things.
>> 
>> The key here is the recoverable part - if you can isolate the state and 
>> recover (throw-away, rollback), then I agree, that can be a preferable 
>> approach. My point is about truly irrecoverable states - WTF states where 
>> you don't expect to get into, can't do anything meaningful in response and 
>> will likely cause cascading issues.
> 
> Sure, but whether a state is recoverable is determined in part by how hard 
> you are willing to work to recover.  In some tools, you won't get enough 
> benefit from making recovery possible to justify the effort.  I think the 
> compiler is probably one such tool.  But I assert that lldb is a tool where 
> there are real human beings that will thank us each time we manage to 
> side-step some little present left in our pathway by the debug info, the 
> system API's, the way flakey programs that are about to crash behave, etc...
> 
> And how much you rule out is also not a unitary thing.  For instance, when 
> swift is reading in it's module files to recover swift type information, it 
> can (and still does way too often) find itself in an inconsistent state such 
> that if you try to look at the module any further, swift is likely to crash.  
> So we changed all accesses to the SwiftASTContext to check that the context 
> was still copacetic, and if it was not, we would return an error.  Then 
> Enrico actually put in a build phase parser that went through the lldb 
> sources and ensured that all functions that used AST contexts checked that 
> they were okay before touching them, so you couldn't make the mistake of not 
> checking & handling the error.  That was, even if one swift module was bad, 
> the debugging session could go on safely.
> 
> It took a fair bit of work to get this right, but it was definitely worth the 
> effort.  
> 
> BTW, that's what I meant when I said that if you assert in all the 
> subcomponents you're putting the decision of how important an error at the 
> wrong place in the stack.  For the swift compiler, not being able to read in 
> its swift module is an irrecoverable state.  For lldb it is not.
> 
>> 
>> As a side note, I would *love* a generic roll-back mechanism in LLDB. Not 
>> only that it can give us a viable alternative to failing fast, but it can 
>> also allow things like LLDB-command interruption (which is one thing I'd 
>> like to take a closer look when I get a chance)
> 
> lldb command interruption requires two parts, one is the rollback, and the 
> other is to have someone listening for interruption.  To give an instance 
> that works in lldb today, if you run the "expr" command, while we are in the 
> process of compiling there's no way to interrupt "expr" because clang doesn't 
> listen for interrupts.  But once the expression is running, we're back in the 
> lldb event loop and so we can handle interruption.  And of course if you 
> interrupt an expression while running it knows how to clean up after itself.  
> That wasn't done for interruption primarily, rather "expr" already had to 
> know that to be able to clean up from expressions that crashed or raised 
> signals while running.
> 
> I don't think we need a general "interrupt me anywhere you want" mechanism to 
> support command interruption however.  Most of the time-consuming work in 
> lldb comes in self-contained chunks (reading in symbols from shared 
> libraries, reading in the debug info index tables, etc) or happens while lldb 
> is actually waiting for something to happen (that part already is 
> interruptible at a low level).  We try to do work as lazily as possible, so 
> if the chunk boundaries and the lazy work boundaries line up you won't really 
> need to roll-back at all.   The model where you check for interrupts after 
> each chunk of work is done will I think be responsive enough for user's 
> needs, and also greatly simplify the work to handle interruption.  This is a 
> little more manual, since you have to insert appropriate "yields", but I 
> think will be good enough for our purposes and much easier to make stable.
> 
> This is something we've wanted to add to lldb since the start, so if you're 
> interested in looking into this further that would be great.
> 
> Jim
> 
>> 
>> On Mon, Sep 11, 2017 at 10:55 AM, Jim Ingham <jing...@apple.com> wrote:
>> 
>>> On Sep 11, 2017, at 10:24 AM, Leonard Mosescu <mose...@google.com> wrote:
>>> 
>>> I think everyone is in agreement that input should be sanitized, and 
>>> ill-formed input should not trigger an abrupt termination if there's a way 
>>> to recover (which is not always feasible).
>>> 
>>> The conversation started around fail-fast for inconsistent and 
>>> unrecoverable state, where I think the math is very clear in this case. 
>>> Using one the the comments about the long-lasting debugging sessions as 
>>> example, let's say that we have a bug and the debugger state gets corrupted 
>>> after 1h of debugging:
>>> 1. debugger crashes immediately: lost 1h of debugging -> angry bug report 
>>> (or ideally automated bug report)
>>> 2. debugger limps along for a few more hours behaving erratically (not 
>>> necessarily in an obvious fashion)
>>> a. developer gives up after many hours or debugging without understanding 
>>> what's going on
>>> b. developer figures out that the debugger is unreliable
>>> 
>>> While it is choosing between two evils, I think #1 is clearly the right 
>>> choice: #2 results in more lost time and frustration, less chance of 
>>> getting a bug report and can result in a general impression that the 
>>> debugger is not to be trusted (which for a developer tool is infinitely 
>>> worse than an unstable tool)
>> 
>> I think your divisions into options is too simplistic.  lldb is built of 
>> many different subsystems, and a lot of "inconsistent state" can be confined 
>> to one subsystem or other.  So for instance, if one CU's debug is 
>> unrecoverably damaged, we can just pretend that that .o file was built w/o 
>> debug info and the user can do lots of useful debugging still.  It is much 
>> better if we can say "Found irrecoverable errors in this CU, dumping debug 
>> info" and continue rather than crashing because one CU that the user may not 
>> even care about is all horked up.  Similarly a single JIT execution can get 
>> itself all balled up and yet if we can discard the work it did we can 
>> continue doing lots of other things.
>> 
>> So I think there are lots of places where you can get into a bad state 
>> locally, but that doesn't mean #2 is the outcome.  Sometimes you even get 
>> into a state where a specific target has done something totally 
>> incomprehensible.  In that case, you may have to jettison that target, but 
>> that might not be the only target/debugger in this lldb session.  So you 
>> still wouldn't want to quit, those targets are probably fine.
>> 
>> As you note, this is a little off from the original discussion which was 
>> more about setting up and enforcing logic flows using the lldb_private 
>> API's.  If you really have to call B only after calling A, and you can 
>> clearly signal to yourself that A was called, then it is okay put an assert 
>> there.  But in your case, CanResume seems like an optimization - and you are 
>> clearly using it as such to avoid having to fix something deeper down in 
>> lldb.  But in that case DoResume can error for other reasons, and so the 
>> system should deal with it.  So adding another hard requirement was 
>> inelegant and exiting if it wasn't met just made that more obvious...
>> 
>>> 
>>> --------------------------------------
>>> 
>>> Going back to the original fix, I think most observations are very 
>>> pertinent - yes, the result is not ideal. This is because the current 
>>> logic, changing the state in Process::Resume() before the whole operation 
>>> completes successfully is ... unfortunate. So here are the options I see:
>>> 
>>> 1. Fix Process::Resume() so we only change the state late, after everything 
>>> succeds
>>> 2. Rollback state change if anything fails while resuming
>>> 3. Patch the specific case of "resume is not supported"
>>> 4. Do nothing
>>> 
>>> The proposed fix is #3, since it's the less intrusive (ie. doesn't change 
>>> the core resume logic) patch for a very specific and narrow case. IMO the 
>>> "right" fix is #1, but it seems a more radical change. #2 seems tempting in 
>>> terms of code changes, but I find it less compelling overall since 
>>> rollbacks are generally fragile if the code is not designed with that in 
>>> mind.
>>> 
>>> So, LLDB experts, what do you think? I'd be happy to go with #1 if someone 
>>> with experience in this area can review the change. Anything else that I 
>>> missed?
>>> 
>> 
>> Seems like #1 is the right choice if you're willing to give it a try.  You 
>> are right that this code is tricky, but it would be worth the effort to fix 
>> it right, if for no other reason than to get some other eyes on it.  I'm 
>> happy to answer any questions as you go along.  This code has ping-ponged 
>> back and forth between Greg and me, so we should both be able to help you 
>> out.
>> 
>> Jim
>> 
>>> Thanks!
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Sep 10, 2017 at 1:52 PM, Jim Ingham via lldb-commits 
>>> <lldb-commits@lists.llvm.org> wrote:
>>> 
>>>> On Sep 9, 2017, at 7:38 PM, Zachary Turner <ztur...@google.com> wrote:
>>>> 
>>>> Sure, but reading a core file involves handling user input. Obviously that 
>>>> has to be sanitized and you can't crash on user input. I don't think 
>>>> there's ever been any disagreement about that. On the other hand, if you 
>>>> just type an expression in the debugger and expect an answer back, after 
>>>> clang says the ast is valid, the whole stack should be asserts all the way 
>>>> down, because everything is validated
>>> 
>>> Yes, but even that ends up not matching with the facts on the ground, 
>>> because the debug information for types as used by lldb has to be produced 
>>> from debug info, and the debug info can be malformed and not infrequently 
>>> is.  So at some point clang is going to ask for something that it assumes 
>>> must always be available because if it had read the types in from header 
>>> files it would have asserted at an early stage but when it is fed types 
>>> from debug info incrementally - which is the only efficient way the 
>>> debugger can do it - is going to cause something to look wrong at a point 
>>> where that “shouldn’t be possible.”  clang's rigid structure means it can’t 
>>> cope with this and will crash on something the user didn’t themselves get 
>>> wrong.
>>> 
>>> According to your classification, we should really consider debug 
>>> information “user input” as well.  But that means we’re wiring “user input” 
>>> into a pretty deep place in clang and you’d probably have to weaken your 
>>> strict assert model to handle that. And it makes me wonder if there’s 
>>> anything the debugger consumes that wouldn’t fall into this category?
>>> 
>>> For instance, we also let plugins provide threads and their stacks, and we 
>>> try pretty hard not to crash and burn if these plugins get it a little 
>>> wrong.  So process state really needs to be treated as “user data” as well. 
>>>  And since we use SPI’s for most of this info that we are pretty much the 
>>> only clients of, they can be expected to be flakey as well, again a fact 
>>> that we should not inflict on users of the debugger if we can help it.
>>> 
>>> The answer of trying to do everything dangerous out of process I think will 
>>> make a  slow and fragile architecture.  And while I agree with the llvm.org 
>>> decision not to use C++ exceptions because they make reasoning about the 
>>> program more difficult, replacing that with SIGSEGV as the exception 
>>> mechanism of choice seems really weak.  You are not the first person to 
>>> point out that we could use the crash-protected threads, but I’m remain 
>>> very unenthusiastic about this approach...
>>> 
>>> Moreover, there are a bunch of problematic areas in lldb, but I don’t think 
>>> any of them are problematic for the reasons that you are describing here.  
>>> There are areas of weak memory management in the data formatters & some 
>>> places where we don’t manage locking right in handling process stops & the 
>>> reconstruction of program state after that, and those cause the vast 
>>> majority of the infrequent but annoying crashes we see.  I don’t myself 
>>> believe that putting in this assert everywhere architecture would pay off 
>>> in reduced crashes to compensate for the brittleness it would introduce.
>>> 
>>> Finally, and most crucially, no failure of any expression evaluation and no 
>>> malformed type is worth taking down a debug session, and it’s our 
>>> responsibility working on lldb to make sure it doesn't.  If the makes our 
>>> job a little harder, so be it.  This was Jason’s point earlier, but it is 
>>> worth reiterating because it makes lldb a very different kind of program 
>>> from all the other programs in the suite collected under the llvm project.  
>>> If the compiler gets into an inconsistent state compiling a source file, 
>>> it’s fine for it to just exit.  It wasn’t going to do any more useful work 
>>> anyway.  That is definitely NOT true of lldb, and makes reasonable the 
>>> notion that general solutions that seem appropriate for the other tools are 
>>> not appropriate for lldb.
>>> 
>>> 
>>> Jim
>>> 
>>> 
>>>> On Sat, Sep 9, 2017 at 6:53 PM Jim Ingham <jing...@apple.com> wrote:
>>>> I think we are talking at cross purposes.  Seems to me you are saying “If 
>>>> we can assert that the answers to questions I ask must always copacetic, 
>>>> then we can express that in software, and that will make things so much 
>>>> easier".  I’m saying "my experience of the data debuggers have to deal 
>>>> with is such that assuming such assertions will lead you to over-react to 
>>>> such errors.  Instead you should write your code so that if somebody gives 
>>>> you bad data you don’t fall over allowing the people who called you to 
>>>> decide how important the error was.”  Every core file that was written out 
>>>> by OS X for years had a section that was ill-formed.  Asserting when you 
>>>> get an ill-formed object file might seem a good way to ensure that you 
>>>> don’t have to make guesses that might lead you astray.  But the people who 
>>>> need to load core files on OS X are rightly unmoved by such arguments if 
>>>> lldb disappears out from under them when reading in core files.
>>>> 
>>>> Jim
>>>> 
>>>> 
>>>> 
>>>>> On Sep 9, 2017, at 1:31 PM, Zachary Turner <ztur...@google.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Sep 9, 2017 at 12:04 PM Jim Ingham <jing...@apple.com> wrote:
>>>>> 
>>>>> I disagree here.  If you can reasonably unwind from an error you should 
>>>>> do so even if you can’t figure out how you could have gotten an answer 
>>>>> you didn’t expect.  If an API is returning a pointer to something you 
>>>>> should assume it might return nullptr unless the API explicitly states 
>>>>> otherwise.
>>>>> But that's exactly what an assert is.  It's an explicit statement by the 
>>>>> API about what should happen.  Which is why by adding them liberally, 
>>>>> these assumptions can then be propagated all the way up through many 
>>>>> layers of the code, vastly simplifying the codebase.
>>>>> 
>>>>> if you have
>>>>> 
>>>>> void *foo(int x) {
>>>>>  // do some stuff
>>>>> 
>>>>>  assert(x < 0 || foo != nullptr);
>>>>> }
>>>>> 
>>>>> Then you're documenting that if x is greater than 0, the caller doesn't 
>>>>> need to check the return value for nullptr.  Now instead of this:
>>>>> 
>>>>> void *bar(unsigned x) {
>>>>>  void *ptr = foo(x);
>>>>>  if (!ptr) {
>>>>>    // log an error
>>>>>    return nullptr;
>>>>>  }
>>>>>  return ptr;
>>>>> }
>>>>> 
>>>>> You just have
>>>>> 
>>>>> void *bar(unsigned x) {
>>>>>  void *ptr = foo(x);
>>>>>  assert(x);
>>>>>  return x;
>>>>> }
>>>>> 
>>>>> And now the caller of bar doesn't have to check either.  The code has 
>>>>> greatly reduced complexity due to the butterfly efflect of propagating 
>>>>> these assumptions up.
>>>>> 
>>>>> This is a simple example but the point is that building assumptions into 
>>>>> your API is a good thing, because you can enforce them and it vastly 
>>>>> simplifies the code.
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> lldb-commits mailing list
>>> lldb-commits@lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
>>> 
>>> 
>> 
>> 
> 
> _______________________________________________
> lldb-commits mailing list
> lldb-commits@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

_______________________________________________
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

Re: [Lldb-commits] [PATCH] D37651: Fix for bug 34532 - A few rough corners related to post-mortem debugging (core/minidump)

Reply via email to