[PATCH] D74094: Reapply: [IRGen] Emit lifetime intrinsics around temporary aggregate argument allocas

Nick Desaulniers via Phabricator via cfe-commits Mon, 18 Sep 2023 15:41:55 -0700

nickdesaulniers added a comment.

(Sorry for the wall of text)

In D74094#4646297 <https://reviews.llvm.org/D74094#4646297>, @tellenbach wrote:

> No real comment on the issue itself but on the example as a former Eigen 
> maintainer (sorry for the noise if that's all obvious for you):

Woah! Deus Ex Machina?

>   auto round (Tensor m) {
>       return (m + 0.5f).cast<int>().cast<float>();
>   }
>
> does not return a Tensor but an expression encoding the computation which is 
> evaluated during the assignment `const Tensor t3 = round(a.log().exp());` 
> using an overloaded `operator=`. With "evaluation" I mean evaluation the in 
> the sense of performing the intended mathematical computation.
>
> Many things can go wrong when using `auto` in such cases, of which two seem 
> to be relevant here:
>
> 1. Eigen can (this is an implementation detail(!)) decide to evaluate 
> subexpressions into temporaries. The returned expression would then reference 
> the result of such an evaluation beyond its lifetime.
> 2. If 1. does not happen, the unevaluated expression would reference its 
> arguments. The immediate `0.5f` is copied since that's cheap, but the tensor 
> would be referenced. Your example function `round` takes its parameter 
> by-value and the returned expression would reference it. I'm unsure if the 
> lifetime would be extended in this case (again, the exact details of C++ 
> lifetime rules are not my area of expertise). I think if the lifetime would 
> be extended, ASAN complaining after applying this patch is wrong, if the 
> lifetime is not extended ASAN should complain and the example is wrong.
>
> Btw, these issues are so common that Eigen documents the recommendation to 
> avoid using `auto` with Eigen types all together: 
> https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3

Thanks for the info, https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3 
is quite helpful.  That helps explain why the return value was deduced to a 
`const Eigen::TensorConversionOp<....`. Sounds like this is some kind of lazy 
evaluation?

Of note, I modified the reduced test case to use a static function with `auto` 
return type. But the code as written originally used a lambda!

  auto round = [](Tensor m) {
    return (m + 0.5f).cast<int>().cast<float>();
  }

So the original author was well intentioned; generally you want to use `auto` 
rather than the full type of the lambda.  But the author may not have been 
aware of the advice or hazards described by 
https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3.  Our fix which we've 
committed is to use the `->` for the return type of the lambda.

  -auto round = [](Tensor m) {
  +auto round = [](Tensor m) -> Tensor {
     return (m + 0.5f).cast<int>().cast<float>();
   }

Which I think is inline wrt. the advice in 
https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3. (i.e. evaluate the 
expression so that we don't have some sort of partially evaluated calculation 
containing references to temporaries).  So I think the code in question (that 
we reverted this over) was just wrong.

One thing that's problematic is that neither clang or gcc do a great job with 
`-Wreturn-stack-address` (clang) or `-Wreturn-local-addr` (GCC): 
https://godbolt.org/z/6oq5nKnY7 Clang only spots obviously incorrect examples; 
GCC depends on optimization level.

I wonder if there's anything more we can do to help spot these at compile-time 
during semantic analysis.  It seems for more complex types, it could even be 
very beneficial.  The examples in eigen come to mind (could we catch those, for 
example?)

---

In D74094#4646954 <https://reviews.llvm.org/D74094#4646954>, @rjmccall wrote:

> Okay, thanks, I can see how that works, and I definitely would never had 
> figured that out from just looking at this code.  The formal lifetime of a 
> parameter of type `Tensor` is definitely in the land of 
> implementation-defined behavior, so this code seems to be at best 
> non-portable.

Right, so that was my first question here; sounds like the optimization is 
correct wrt. the example that broke.  Though I still could not tell you 
precisely why.  Perhaps, "`round` returns an object, but one that contains 
references to temporary variables`."

> Nick, maybe we can take a new look at this patch from that perspective.  
> You've been trying to use very tight lifetime bounds for these objects that 
> only cover the duration of call, which essentially means you're defining the 
> lifetime of parameter objects to just be the call rather than the 
> full-expression (at least, when the parameter type doesn't have a 
> destructor).  In the abstract, that's probably the right rule for Clang to 
> adopt, because I think it matches programmer intuition (it's always wrong to 
> return the address of a local, and that includes by-value parameters), and it 
> means we'd have a consistent rule for all types and targets.  It's also a 
> fairly aggressive rule that's likely to uncover a fair amount of code like 
> this that's assuming longer lifetimes.  I think it's reasonable to say that 
> these examples are all program bugs that should be fixed, but it might be 
> better to pursue that as a *refinement* on top of an initially more 
> conservative rule.  I suspect that that conservative rule will also give us a 
> large fraction of the optimization benefit that we can expect from this 
> change, because at least parameter objects from calls in different 
> full-expressions will be able to share memory.  So we can start by giving 
> these objects full-expression lifetime, chase down any program bugs that that 
> uncovers (which will all *unquestionably* be program bugs under the 
> standard), and then gradually work on landing the more aggressive rule 
> (perhaps even for non-trivial types).

Let me see if I can picture how the lifetimes are expected to work, for my own 
sake.  Let's say we have functions `foo`, `bar`, and `baz` that all take a 
`struct widget` by value and return a `struct widget`.  Then let's say we have 
code like:

  struct widget { long x, y, z, w; };

  widget foo(widget);
  widget bar(widget);
  widget baz(widget);

  void quux () {
      widget w = foo(bar(baz({ .x = 42, })));
  }

With b7f4915644844fb9f32e8763922a070f5fe4fd29 
<https://reviews.llvm.org/rGb7f4915644844fb9f32e8763922a070f5fe4fd29> reverted, 
we'd get:

  define dso_local void @_Z4quuxv() local_unnamed_addr #0 {
  entry:
    %w = alloca %struct.widget, align 8
    %agg.tmp = alloca %struct.widget, align 8
    %agg.tmp1 = alloca %struct.widget, align 8
    %agg.tmp2 = alloca %struct.widget, align 8
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp2) #4
    call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 
dereferenceable(32) %agg.tmp2, i8 0, i64 32, i1 false)
    store i64 42, ptr %agg.tmp2, align 8, !tbaa !5
    call void @_Z3baz6widget(ptr nonnull sret(%struct.widget) align 8 
%agg.tmp1, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp2)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp2) #4
    call void @_Z3bar6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, 
ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp1)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @_Z3foo6widget(ptr nonnull sret(%struct.widget) align 8 %w, ptr 
noundef nonnull byval(%struct.widget) align 8 %agg.tmp)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
    ret void
  }

which //is// very aggressive; each temporary is marked dead for the next 
non-intrinsic `call`. This looks like it would be optimal.  IIUC, @rjmccall 's 
proposal is instead to do something a la:

  define dso_local void @_Z4quuxv() local_unnamed_addr #0 {
  entry:
    %w = alloca %struct.widget, align 8
    %agg.tmp = alloca %struct.widget, align 8
    %agg.tmp1 = alloca %struct.widget, align 8
    %agg.tmp2 = alloca %struct.widget, align 8
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp2) #4
    call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 
dereferenceable(32) %agg.tmp2, i8 0, i64 32, i1 false)
    store i64 42, ptr %agg.tmp2, align 8, !tbaa !5
    call void @_Z3baz6widget(ptr nonnull sret(%struct.widget) align 8 
%agg.tmp1, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp2)
    call void @_Z3bar6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, 
ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp1)
    call void @_Z3foo6widget(ptr nonnull sret(%struct.widget) align 8 %w, ptr 
noundef nonnull byval(%struct.widget) align 8 %agg.tmp)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp2) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
    ret void
  }

i.e. to have a less aggressive optimization where the lifetimes extend to the 
full expression, and the temporaries would not be able to share stack slots.  
That's better than the status quo (no lifetime markers at all) but still 
suboptimal, and I don't think it will solve the particular case I care about (a 
particular Linux kernel driver written in C which is passing many aggregates by 
value).

Do I want to create a flag for that?  Not particularly; it does mean supporting 
BOTH codegen patterns; I'd rather have a flag to control the optimization or 
not do it at all rather than support 2 different optimizations.  I'm also 
curious whether you'd expect the default value of the flag to be on (more 
aggressive) or not (less aggressive)?  I don't want to have to add `-Xclang` 
(or `-mllvm`) flags to my build system to opt into secret optimizations.  If 
folks find that this optimization breaks their incorrect code, they can use the 
already existing`-Xclang -disable-lifetime-markers` to opt out of optimizations 
(though more optimizations than just those introduced by this change). I like 
that pattern better, since using those flags is itself a red flag to be 
scrutinized during code review.  Then again GCC does have `-fconserve-stack` 
that clang does not; we could put more aggressive stack saving optimizations 
under that, though IME in GCC that flag limits inline substitution to a magic 
per-target-architecture number of bytes. Or hide it under `-O3` perhaps?

At the least, a release note to at least acknowledge that "you're not crazy; 
clang-18 is now more precise about minimizing the lifetime of temporaries 
(since the C++ spec says it's implementation defined) in order to reduce stack 
usage. ASAN is your friend. If the code works under `-Xclang 
-disable-lifetime-markers` then this is likely the culprit." would be an 
improvement if we were to reland this.

Personally, I feel that "we found a buggy example of code we broke, we fixed 
it. Let's reland it and tell people to be careful when upgrading."  But I 
haven't been around long enough to know what's the precedent here for 
"aggressive optimizations."  Do you feel strongly that we should not just 
reland this, @rjmccall (or anyone else)?

---

Converting the above `widget` example to C, I noticed that we're also missing 
the lifetime markers on compound literals!

  struct widget { long x, y, z, w; };

  struct widget foo(struct widget);
  struct widget bar(struct widget);
  struct widget baz(struct widget);

  void quux () {
      struct widget w = foo(bar(baz((struct widget){ .x = 42, })));
  }

->

  define dso_local void @quux() local_unnamed_addr #0 {
  entry:
    %w = alloca %struct.widget, align 8
    %agg.tmp = alloca %struct.widget, align 8
    %agg.tmp1 = alloca %struct.widget, align 8
    %.compoundliteral = alloca %struct.widget, align 8    ; <--- NO LIFETIME 
MARKERS :(
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 
dereferenceable(32) %.compoundliteral, i8 0, i64 32, i1 false)
    store i64 42, ptr %.compoundliteral, align 8, !tbaa !5
    call void @baz(ptr nonnull sret(%struct.widget) align 8 %agg.tmp1, ptr 
noundef nonnull byval(%struct.widget) align 8 %.compoundliteral) #4
    call void @bar(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, ptr 
noundef nonnull byval(%struct.widget) align 8 %agg.tmp1) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @foo(ptr nonnull sret(%struct.widget) align 8 %w, ptr noundef 
nonnull byval(%struct.widget) align 8 %agg.tmp) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
    ret void
  }

---

One thing I noticed, simply printing the expression when we were about to add 
the lifetime cleanup; I noticed some examples would trigger on 
`CXXConstructExpr`, `CXXTemporaryObjectExpr`, and `CXXOperatorCallExpr`.  Those 
feel very C++ specific; I do wonder if we could do something differently for C, 
but I think we should not since as @rjmccall said "we'd have a consistent rule 
for all types and targets" which IMO is nicer+simpler.

---

I tried to write a verifier check to detect when a use was dominated by a call 
to the lifetime end intrinsic.  I don't think that's what's going wrong here, 
but my initial hypothesis was that we were somehow getting this wrong in 
codegen, which would lead to bugs.

This failed on multiple examples in tree; 
llvm/test/CodeGen/AArch64/stack-tagging.ll has some examples that surprised me. 
Such as values with lifetimes that restart after being ended:

  call void @llvm.lifetime.start.p0(i64 4, ptr nonnull %x)
  call void @use8(ptr %x) #3
  call void @llvm.lifetime.end.p0(i64 4, ptr nonnull %x)

  call void @llvm.lifetime.start.p0(i64 4, ptr nonnull %x)
  call void @use8(ptr %x) #3
  call void @llvm.lifetime.end.p0(i64 4, ptr nonnull %x)

which makes it so that we can't simply use a single dominance check to check 
for uses after lifetime has ended  (though I guess this is just a kind of 
bounds check; maybe if the use is then additionally dominated by a lifetime 
start and the start is dominated by the end then we're "resurrected.").  
Perhaps a separate discussion orthogonal to this bug, but after reading

- https://llvm.org/docs/LangRef.html#llvm-lifetime-start-intrinsic
- https://llvm.org/docs/LangRef.html#llvm-lifetime-end-intrinsic

another question unresolved in my mind is "why do we even need intrinsics for 
lifetime? why can't we use the initial use as the implicit lifetime start, and 
the final use as the implicit lifetime end?"  I'm certain this is a naive 
question that probably depends on the semantics of a specific language frontend 
in particular (or even ABI; perhaps that's why the C++ spec mentions 
"implementation defined."  I can't recall where I saw this, but for some reason 
I suspect the ABI used for Microsoft platforms differs here?).  I also don't 
understand under what circumstances that something is marked dead can it later 
be "brought back to life/resurrected."  Why is that a thing?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74094/new/

https://reviews.llvm.org/D74094

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D74094: Reapply: [IRGen] Emit lifetime intrinsics around temporary aggregate argument allocas

Reply via email to