Hi Milian, On Fri, 2017-06-02 at 17:03 +0200, Milian Wolff wrote: > Some more debugging and going after my gut feeling brings me to the following > conclusion: The real issue seems to be the on-demand reporting of the elf > file. We used to do: > > Dwarf_Addr pc; > bool isactivation; > > if (!dwfl_frame_pc(state, &pc, &isactivation)) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } > > // report the module before we query for isactivation > report_module(pc, ui); > > This looks safe and fine and actually works most of the time. But passing a > non-null isactivation flag to dwfl_frame_pc potentially leads to a second > unwind step, before we got the change to report the module! I can workaround > this by instead doing > > Dwarf_Addr pc; > bool isactivation; > > if (!dwfl_frame_pc(state, &pc, NULL)) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } > > // report the module before we query for isactivation > report_module(pc, ui); > > if (!dwfl_frame_pc(state, &pc, &isactivation)) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } > > This fixes all the issues in my original email. So sorry for the noise - it > doesn't see as if the unwinding code in elfutils is broken - quite the > contrary! It's just our misuse of the API that is to blame.
That is some very nice detective work. The isactivation code is tricky. The logic is: isactivation = (frame is initial_frame || frame is signal_frame || unwind_frame is signal_frame) The idea being that normally a frame isn't an activation. So the given pc is really the return address, not the call address. But if it is the initial frame or a frame caused by a signal, then the pc really is where the code was when this frame was active. The tricky case is that last case. If the previous frame is a signal frame then this frame is also an activation (so you don't need to subtract 1 to get the actual address that was executing). I am not sure I fully understand anymore why that is. This must be a frame that is itself not an initial or signal frame, but that is called (activated) from a signal frame. But assuming that is the correct semantics then your original observation seems to explain what is going on. The code doesn't account for the fact that during an unwind other modules can be reported that might make it able to unwind the current frame. And it caches the failure state of the unwind of the current frame, so it doesn't retry when asked. I think we could reset the state of the current frame if any more modules are reported. But is it possible for your parser or perf to report modules as soon as it knows about them, not lazily during an unwind? Is reporting lazily at the last possible moment an optimization that really helps? I would assume that you end up reporting all modules anyway. So it actually seems easier to report everything upfront. > May I suggest though to move the isactivation code into a separate function > to > prevent this issue from arising in the future? I.e. it would be nice if the > code above could read: > > > Dwarf_Addr pc; > bool isactivation; > > if (!dwfl_frame_pc(state, &pc)) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } > > // report the module before we query for isactivation > report_module(pc, ui); > > if (!dwfl_frame_is_activation(state)) { > --pc; > } That is a good suggestion in any case. Maybe also introduce dwfl_frame_is_signal_frame () and dwfl_frame_is_initial_frame () in case the user wants to know the exact details. Thanks, Mark