Re: Make mine SuperSized....

Bryan C. Warnock Mon, 09 Jun 2003 21:57:21 -0700

On Fri, 2003-06-06 at 15:12, Dan Sugalski wrote:
> Our options, as I see them, are:
> 
> 1) Make the I registers 64 bits
> 2) Make some way to gang together I registers to make 64 bit things
> 3) Have I registers switchable between 32 and 64 bit somehow
> 4) Have separate 32 and 64 bit I registers
> 5) Do guaranteed 64 bit math in PMCs
> 
> The first is just out. It's an unreasonable slowdown on 32 bit (and 
> some 64 bit) machines, for no overall win. The majority of integers 
> will be smallish, and most of even the 32 bit range will be wasted.


I don't necessarily agree that this option is gone.  IREGs are basically
used for one of two things.  To do non-PMC integer math, and to pass
things to and from Parrot's guts.  (And then you're talking a store and
a load.  I think passing them throughout Parrot is where the problem
is.)  So that would leave doing non-PMC integer math.  That just doesn't
sound like a whole lot.  (But then again, I'm assuming that most math
will be PMC-based, in order to handle int->num->str->big type
conversions.  If we want to minimize PMC-math, then perhaps this is a
bigger deal.)

You know, there was a day when we'd just write some code and benchmark
it to see *how* much slower it is....

No, no, no.  Don't get up.  I'll do it.  :-)

Gluing together most of the IREG-based arithmetics pasm files, removing
the prints, and wrapping an iterator around it.

Athlon 1 GHz, Linux 2.4.20.  Identical Parrot configurations, save the
size of INTVALs.

long long INTVALs: 4.98u @ 54%
long INTVALS     : 4.31u @ 54%

Difference, .67u @ 54%, or about 15%.  (With the JIT, long long INTVALs
were *much* faster, but only because they cheated and dumped core.)

So what percentage of a program is using the IREGs for math?  10%?  5%? 
2%?  That's a 1.5% to .3% overall slow down.  Keep those numbers in
mind.

> 
> I don't like option 2, since it means that we speed-penalize 64 bit 
> systems, which seems foolish.

See below.

> 
> Option 3 wastes half the L1 cache space that I registers takes up. 
> Fluffy caches--ick. Plus validating the bytecode will be... 
> interesting, even at runtime.

See below.

> 
> 4 isn't that bad. Not great, as it's more registers, and something of 
> a waste on 64 bit systems, but...

See below.

> 
> #5 is something of a cop-out, but I'm not quite sure how much.

See below.

> 
>  From what I can think, we need guaranteed 64 bit integers for file 
> offsets, JVM & .NET support, and some fairly special-purpose math 
> stuff. I'd tend to discount the special-purpose math stuff--that's 
> not our target. JVM and .NET don't do much 64 bit stuff, but they do 
> some. The file offset parts are in some ways the least of it, though 
> we do need to have some internal support for 64 bits to get integer 
> values out of PMCs without loss.

See below.  Oh, wait.  This *is* below.  Okay, see here.

Let's back up a step.  When it comes to integers, there are two types -
no pun intended - of languages.  Those that care, and those that don't.

Sized integer math has two properties to it, which are intertwined:
dynamic range and mathematical semantics.  (Dynamic range states that 8
bits can hold 8 bits worth of stuff, whether it's interpreted as signed,
unsigned, or normalized (like exponents in IEEE floating point
representations); as either numbers or bits.  Mathematical semantics are
what make

    (int32_t)((int8_t)0x66 + (int8_t)0x66) == (int32_t)0xffffffcc

rather than 0x000000cc.)

Although there will be cases where a typed language doesn't really care
how large the range or the nature of the mathematical semantics for a
given type, there will be times that it does.  So we've either got to
provide, somehow, all types, or provide one type that emulates the
semantics of all types.

Untyped languages simply don't care what they get underneath, as long as
they work.  Except, of course, when they're trying to tie into a typed
language.  (Pass a 16-bit int from Java to Perl, do some stuff, and pass
it back, for instance.)

Hardware handles this with different ops, of course, although compilers
cheat where they can (or have to).  For Parrot, however, that means
multiplying the number of ops by 4 or 5.  (Multiple IREG ops would still
be a common multiple and not an exponent, as you'd promote both integers
to the same size.)  I think we're op-heavy, already, and Parrot would
then have to track integer sizes.  (Although for untyped languages,
that'd be easy, as they'd all be one size.)  Plus, you'd have to map
those onto the common set of IREGs.  Or create 4 or 5 more.  (And then
decide how you handle things like integer promotion.)

Of course, you could continue to handle this with one op, albeit smart
enough to handle the semantics of whatever size math you're doing.  That
way, you'd only be doing the slow, 64-bit math when you absolutely
needed to.

The problem is, of course, those numbers up top I told you to remember. 
Writing that smart op is going to cost you far more than a mere 1.5%. 
You've slowed everything down to speed up one case, which, by the way,
didn't speed up because you're jumping through such hoops to avoid it.

Even the JIT may not handle this efficiently.  Certainly, at everything
less than native size, it's normally a trivial tweak or two:

.L2:
        movb    $4, -1(%ebp)
        movb    $10, -2(%ebp)
        movb    -2(%ebp), %al
        addb    -1(%ebp), %al
        movb    %al, -3(%ebp)
.L3:
        movl    $1434, -8(%ebp)
        movl    $345344, -12(%ebp)
        movl    -12(%ebp), %eax
        addl    -8(%ebp), %eax
        movl    %eax, -16(%ebp)

But once you have to loosen your belt, your code blows up to:

.L4:
        movl    $24234234, -24(%ebp)
        movl    $0, -20(%ebp)
        movl    $42342342, -32(%ebp)
        movl    $0, -28(%ebp)
        movl    -32(%ebp), %eax
        movl    -28(%ebp), %edx
        addl    -24(%ebp), %eax
        adcl    -20(%ebp), %edx
        movl    %eax, -40(%ebp)
        movl    %edx, -36(%ebp)

So that brings us back to one big, flat space.  Either the ideal system
width, which will run faster, or the largest width possible.  If we
choose the ideal system width, it may be too small to support typed
languages, or the occasional system metric which requires it.  (Like
64-bit file offsets, which Dan ever-so-kindly reminded me of.)  If we
choose the largest width possible, we slow things down, but we can
mostly support everything.  

I say mostly, because there's no telling how typed languages will feel
about being run atop a unitype system, regardless of the size of that
one type.  Of course, the languages should feel free to either create
their own PMCs that map to those types, or create the ops that their
compiler would generate to handle the mathematical semantics of those
types within Parrot's unitype:

    inline op add_8 (out INT, in INT, in INT) {
        $1 = (INTVAL)((int8_t)$2 + (int8_t)$3);
        goto NEXT();
    }

    inline op add_16 (out INT, in INT, in INT) {
        $1 = (INTVAL)((int16_t)$2 + (int16_t)$3);
        goto NEXT();
    }

That puts the impetus on each language to track its own types, but all
types map correctly in, out, and between languages.  (And at the cost of
only one more instruction.)  And it only affects those in need.

But then we're back to where we started, with these big INTVALs running
amok needlessly throughout Parrot.  We certainly want to minimize their
usage, and, practically speaking, their usage is language level math.

After all, C is a typed language, and if we're going to interface with C
(or, more accurately, Parrot's internals and the underlying system,
which are written in C), then we can do it like above.

Luckily, for us, Parrot's internals (at any given time) are pretty well
fixed, which means if an op - say, print - needs to pass a file number,
that number will always be an int.  

     |---- LANGUAGE LAYER ----|----- INTERPRETER LAYER -----|

     program <-> registers   
             <->    ops      <-> parrot internals <-> system

     |---- ARBITRARY SIZES ---|------- SYSTEM SIZES --------|

The boundary between op code and parrot internals is also the boundary
between where arbitrary numbers are needed for language support, and
useless for the system.

So let's convert when we cross that boundary.

1) We gain a performance boost in Parrot's internals, in both faster and
smaller code.
2) We suffer a slight penalty in IREG math.  (But we don't suffer a
larger penalty trying to avoid it.)
3) We keep Parrot simple, and, well.... KISS.
4) We push the complexity and the decisions of integer types to the
specific languages to implement as they see fit - PMC, op, or don't
really care - while providing a common type to convert through, and
without tying them to one all-encompassing model.
5) The coding rules are simple: ops are built on INTVALs, Parrot
internals are not.[1]

How far away are we?

For Parrot internals, it's largely a substitute job.  Find the right
type for the job, and fix the code.  The ops need explicit casting
added.  The biggest problem is probably the JIT, because mandating
64-bit support means a long long on x86, which doesn't JIT right now.
But, overall, that's not that far.

Thoughts?


[1] Of course, you know there *has* to be an exception.  Currently,
Parrot internally provides some direct support routines explicitly for
INTVALs, namely stringification as part of the various *printf routines.
I consider those type of routines more of an "op support library" than
Parrot internals.  (Functionally, although certainly not lexically, as
it currently stands.)  

-- 
Bryan C. Warnock
bwarnock@(gtemail.net|raba.com)

Re: Make mine SuperSized....

Reply via email to