On Tue, Jul 16, 2013 at 2:32 PM, David Jeske <[email protected]> wrote:

> On Tue, Jul 16, 2013 at 12:50 PM, Jonathan S. Shapiro <[email protected]>wrote:
>
>> The app scenario that is problematic is applications that involve large
>> numbers of long-lived objects that (a) contain references and (b) those
>> references are mutated with a rate of change above some hard-to-specify
>> threshold [potentially creating garbage]. Actually, it's not the rate of
>> mutation per se. It's the rate at which mutations cause objects to become
>> orphaned, which is kind of hard to understand.
>>
>
> Long lived value-type objects do not need to contain pointers to be a
> problem, they merely need to be small and referenced by pointers.
>

That may be true in particular collectors. Actually, I'm very certain it's
true for most current collectors.

In a copying old-space collector, it would be utterly trivial to segregate
value-only data into a logically separate heap that was separate enough for
no cross-pollution of cache lines to occur.



> Also, keep in mind your earlier observation, that we're not strictly
> talking about pointer-to-value fraction, but the number of
> pointer-polluted-cachelines. i7 cacheline size is 64bytes. A single 8byte
> pointer in a mixed pointer/data struct will pollute another 56bytes of
> cacheline as far as pointer-tracing performance is concerned. While I'm
> sure there are some large-heap programs where <15% of cachelines are
> pointer polluted. I'm quite sure there are many others where >70% of
> cachelines are pointer-polluted.
>

Your intuition that cache lines matter is correct, but it doesn't
necessarily arise in the way that you believe. It depends a lot on how the
mark pass is designed, and the degree to which it is depth-first vs.
breadth-first. To make matters more confusing, the average object size
varies hugely between different language styles; languages that support
classes with inheritance tend toward larger objects.

Empirically, though, it's very rare to see objects that are data-mostly
with just one or two object references, and not at all hard to re-pack the
object reference fields at compile time to appease the collector.

The only part of your argument that this disputes is the practical
importance of cache lines with a small share of object references in them.
The rest of your analysis, I think, is pretty reasonable.



> Let's look at a simple application subsystem which I'm sure you will agree
> is *commonly*used* in large programs, and which spans both of these
> conditions -- the simple LRU cache. An LRU cache with small value sizes
> will have a very large percentage of pointer-polluted-cachelines...
>

Why on earth would it have that? The LRU cache will have exactly the same
object reference density as the underlying object type. For small objects,
there will be one cache line containing an object reference that, with high
likelihood, will be the same cache line containing the object header (thus
a compulsory load regardless of object reference density). For larger
objects, we need to look at object layout rules for the runtime in greater
detail.

And of course, none of this can even be an issue for languages that lack
explicit support for unboxed types containing references.

In practice, the more significant issue from LRU caches arises from
increased hardware cache collisions that are driving by the stride behavior
that is inherent in the LRU cache.

But if your argument is (in part) that it's possible to build LRU
soft-caches badly, and that in practice people do, I'm sure I agree.


3. Shift to hybrid management strategies, whether those are counted or
>> owned or linear or what have you.
>>
>
> We agree here.   --    AFAIK, non-world-stop typesafe collection for the
> simple LRU cache can not be solved by regions, linear-types, or
> static-analysis. It requires either non-world-stop GC, or ARC (with or
> without GC-like cycle finding).
>

Or some form of hardware-layer concurrency and interlocking. But yes.  If
we can do a non-world-stop GC for orphaned cycles in an ARC system, the
same technique could presumably be used for the heap generally, so that
seems problematic.


> I'd personally like to see more production implementations of ARC +
> non-world-stop cycle finding -- because I think even given C4, there are
> certain applications/subsets where this approach is superior.
>

All of the data I've seen on ARC is that it's dramatically slower than any
other technique, and especially so in concurrent scenarios. But I do agree
that it's worth looking at harder.



> While I believe the LRU cache is a nice simple strawman to discuss the
> problem, many applications have very intermixed pointers and data -- and as
> a result exhibit very high polluted cacheline fraction. What data do you
> think would demonstrate to you that these programs exist? (not saying I
> have the time or interest to generate such data)
>

I think you're kicking a straw dog again. The problem is your assumption
that the object reference fields cannot be reordered by the compiler and/or
the runtime. Which they clearly can, absent an explicit layout constraint
(which is so rare that we can ignore it).


> One could probably do automated static analysis of open-source C/C++ type
> definitions across a few large open-source applications, which combined
> with a bit of type-frequency estimation, could provide us an estimated
> polluted cacheline percentage for those applications. We could probably
> even compute an estimated stop-the-world pause time given a specific heap
> size. (I'd sponsor such research)
>

Perhaps, and I agree that would be interesting to know. Further, given that
data we could conduct the thought experiment of re-ordering the fields
(which isn't legal in C/C++) and see if and how much that actually helps.


> This makes me wonder if perhaps we have under-exploted schemes to
> automatically maximize the pointer-efficiency of
> pointer-polluted-cachelines. There might be un-obvious methods to try here.
> For example, a 64byte structure with only one 8 byte pointer could be
> duplicated into an "multi-bin" version of the same structure... basically
> 7-copies of the same structure merged, with the 7 pointers located in one
> cacheline,..
>

This type of data structure reorganization was explored in some detail in
the IBM Research PL.8 compiler by Fran and Peter Allen and others. The
focus of attention, at the time, was cache access patterns more generally,
and specifically for floating point vector computation. But the techniques
they assembled could work equally well for what you propose. They would be *
much* simpler to implement in a safe language.

My guess is that the simpler expedient of moving object references to the
front of the object at compile time would get you most of the same benefit.
If you can manage to get an object's layout done in such a way that the
underlying cache lines get "colored" as containing or not containing object
references, you'd have done a pretty good job. That gets mildly tricky with
derived classes because of offset consistency constraints, but even so I
think it might be possible, and that the impact might be surprisingly
positive.


> An interventionist, I see. I accept the validity of the strategy. I just
> don't trust anyone's ability to assign appropriate costs.
>

Fair enough, and these concerns make me nervous as well. Yet we know that
the outlay on bug bounties - completely discounting the actual costs of
penetration - is a measurable percentage of software product costs. We also
know that this cost is almost entirely incurred by C and C++ code. I see no
reason not to allocate the overall national cost of penetration in
proportion to the bug bounties paid, and tax software in the respective
source languages according to that proportion.

The nice part is that the tax doesn't have to be very big in order to have
a very large impact on the behavior of programmers and system development.


Jonathan
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to