Re: [bitc-dev] LLVM and GC

Ben Karel Sun, 10 Apr 2011 20:38:55 -0700

On Sat, Apr 9, 2011 at 1:22 AM, Jonathan S. Shapiro <[email protected]>wrote:


> Ben:
>
> I've changed subject. This discussion is important, but I suspect it isn't
> helping Jeremie much.
>
> On Fri, Apr 8, 2011 at 2:39 PM, Ben Karel <[email protected]> wrote:
>
>> On Fri, Apr 8, 2011 at 4:31 AM, Jonathan S. Shapiro <[email protected]>wrote:
>>
>>
>
>>  Could you give a concrete example of an LLVM optimization pass that
>>> destroys type information in a way that undermines GC? I'd like to
>>> experiment with my LLVM-based GC and see if it breaks!
>>>
>>
> Ben:
>
> I think I let a rant take over inappropriately. I did, at least,
> acknowledge that I haven't looked at LLVM internals in a while, and that
> things may well have improved. Back when I last looked, the GCREAD/GCWRITE
> intrinsics did not exist yet.
>
> What I *can* do is give you an example of a classic optimization that (a)
> is necessary, but (b) in typical form makes a mess for GC. Consider
> optimizing something that uses the DOACROSS pattern. In a typical (GC
> oblivious) optimizer, you apply a combination of unrolling, strength
> reduction, and partial evaluation to eliminate register pressure. A very
> common result of this is that array/vector base pointer registers get merged
> with the indexing offset registers into a single pointer that is used to
> index into the middle of the array/vector. In the worst case the resulting
> "indexing" pointer can end up past the end of the array/vector and a
> negative offset can end up getting used.
>
> You're probably aware that Boehm has a big ugly mechanism to deal with such
> "interior" pointers, and that their solution is basically a hack. They rely
> (tacitly) on the fact that the same optimization is not, in practice,
> applied to structure offsets - which is rare in practice, but does in fact
> happen sometimes.
>
> Now we don't want to eliminate this optimization. What we want is for the
> optimization phase to emit information that either (a) preserves the base
> pointer somewhere and identifies these combination values as offsets
> relative to that base - and then makes that info part of the register map,
> or (b) provides an algorithm by which the collector can re-derive the base
> pointer at runtime, or (c) at least clearly identifies the register as
> "likely interior", and restricts the use of interior pointers to cases that
> are known (by the optimizer) to be supported by the runtime.
>
> This particular example is so universal and so important that I would be
> surprised if LLVM does not contain it. If you happen to know what they do in
> this case, and how they avoid boogering a relocating collector, I'ld be very
> interested to hear about it.
>
> In any case, it's an example of the kind of thing where an optimization
> pass written by someone who is thinking in terms of non-managed source
> languages will be written in such a way that you end up re-writing the pass
> when you have to accommodate the new case. We are then presented with the
> difficulty of designing test cases that will let us determine which passes
> are currently "relocating GC safe" and which passes are not.
>

I don't fully understand the example, so I don't know what LLVM winds up
doing. But I don't see anything an optimization pass must do for GC
correctness beyond preserving the same semantics that needed to be preserved
without GC primitives.  The LLVM IR's semantics aren't fundamentally changed
by gcread/gcwrite. If the values that flow to those primitives are corrupted
by an optimization pass, it's just as wrong as if base-and-derived-pointer
parameters to printf were changed in a semantics-violating way.

It's also worth noting that there are neither required nor default
>> optimization passes, at either the LLVM or MC levels. It's all opt-in...
>>
>
> I agree, but ultimately that is a cop-out. The point of going to LLVM is
> that we get to take advantage existing infrastructure. If we are forced to
> turn the existing infrastructure off, then what's the point of using LLVM. I
> do understand that it isn't that black and white, but you see my point.
>

More importantly, giving clients complete control over optimization passes
helps diagnose whatever miscompilations might occur. So if you did find LLVM
producing wrong code, you could zero in on the guilty pass without needing
to recompile or read LLVM's source code. That, in turn, leads to faster
fixes of broken passes.

 Absolutely. And in the absence of a compiler that preserves the necessary
>>> information across phases, there is absolutely no point trying to deploy any
>>> of the good - or even moderately good - GC strategies.
>>>
>>
>> Could you summarize what information LLVM currently loses in practice, in
>> what phases?
>>
>
> No - again because I'm not current on the infrastructure. But consider the
> example I gave above. The practical approach is probably (a), but you need
> to make sure that "derived register" relationships are preserved by
> subsequent passes as they do (e.g.) CSE, partial evaluation, and so forth.
>
> That's one reason why it's useful to have types in your front-end -- the
>> address space as a whole may be heterogeneous, but types can show that
>> subsets of the address space are entirely under the control of the language
>> runtime. That guarantee is modulo intentional or unintentional twiddling
>> from external sources, but if you don't have that guarantee, you're screwed
>> on correctness, not just performance.
>>
>
> That's one reason. Another is that type preservation serves as a very
> strong sanity check on the passes of the compiler.
>
> But I think you're making an assumption here that I do not share. In the
> long term, I think we either need to be thinking in terms of moving
> C/C++/Fortran code into the GC world, or we need to be thinking in terms of
> automated conversion. The "arms length" approach doesn't scale well, and the
> state of the runtime heap is too fragile a thing to leave in the hands of
> programmers if you don't check what they did.
>

I don't know what the best long term evolutionary approach is going to be.
Automated conversion, or incremental language conversion, are both
possibilities. But fault isolation might prove to be good enough in
practice. The results from projects like SoftBound, in particular, are
promising.


>  Custom GC effectively means control over custom allocators. Isn't one of
>> the main points of BitC, and Cyclone, that fine-grained control over memory
>> is important for performance?
>>
>
> No. Fine-grained control over *layout*, yes. Fine-grained control over
> *allocation*, not in general.
>

Depends on what performance metric you have in mind. If you want to minimize
GC pause times, having control over the overall structure of the heap is
very important. If 90% of your heap is bulk binary data, it would be nice to
guarantee that it won't be scanned or copied in a full GC cycle.


>  Compiler infrastructures take a long time to build, and are long term
>>>> investments. They involve core technical decisions that are *very* 
>>>> difficult
>>>> to change (e.g. the IR design). For this reason, successful compiler
>>>> infrastructures simply cannot afford to succumb to ... short-term thinking
>>>> ...
>>>>
>>>
>> I think you're not giving the LLVM team enough credit for their capacity
>> to evolve their infrastructure if they see the need.
>>
>
> Actually, that's not it. Having run one commercial ground-up compiler group
> and worked closely with two others, I think that I have a pretty clear
> handle on the risks and costs here.
>
> Having said that, Chris is a very smart guy, so maybe he will find a way.
>

Out of curiosity, would you have predicted that they could build a full C++
compiler, from scratch, in less than 4 years? I certainly wouldn't have...


> Could you explicate what you perceive to be the fundamental design flaws,
>> and how the LLVM IR needs to be changed?
>>
>
> I can't even say if it's a matter for IR changes. But if they want to do
> GC'd languages well, they need a definition of constitutes well-formed
> input/output from a pass, and what constraints must be maintained in the
> transformation process in order to preserve GC support.
>
> I find the LLVM infrastructure to be a disappointment from this
>>> perspective. Given how recently it was designed, and the clearly desperate
>>> need to deal with mixing managed and non-managed languages, the decision to
>>> omit core support for managed languages from the LLVM design strikes me as a
>>> bad design decision. Micrsoft has definitely gotten this right; I think
>>> Apple has not. The decision to *implement *LLVM in a non-managed
>>> language also strikes me as ill chosen, though that decision more sense to
>>> me situationally.
>>>
>>
>> Let he who is without sin cast the first stone, eh? :-)
>>
>
> Fair enough, but there is an important difference. The decision to
> implement LLVM in C++ was intended from the start to be permanent. The
> decision to implement BitC in C++ was intended from the start to be thrown
> away as fast as we possibly could.
>
> In fact, we only fell back to that approach after finding that we could not
> effectively implement in the more suitable languages that were only just
> emerging at the time. If there had been a viable C# implementation on Linux
> in 2004, I would have used it. This is why I noted that Chris's
> implementation choice makes sense to me situationally.
>
> And you'll note that we *never* attempted an implementation that relied on
> manual memory management.
>


> But the permanent vs. transitional distinction is quite real.
>

There's also the difference that you're building a specific compiler, while
they're building a library for compiler writers. By building in C++ with C
bindings, they maximize their potential client base.

I've written small (1k - 10k) compilers, or fragments thereof, in C++, Java,
OCaml, and Haskell.  So far Haskell has been the best experience. But asking
the masses to switch For Their Own Good is, uh.... Even today, I don't see
any language that the LLVM project could have used in lieu of C++ that would
not have cost them potential clients. We would both like to see that
situation changed, but it stands as-is.

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] LLVM and GC

Reply via email to