offsets into fixed-length char[] blocks. Or use byte offsets into a string table

peter at cordes dot ca Tue, 01 May 2018 05:57:21 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585


            Bug ID: 85585
           Summary: switch to select a string based on an enum can
                    profitably optimize away the table of pointers/offsets
                    into fixed-length char[] blocks.  Or use byte offsets
                    into a string table
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Bug 84011 shows some really silly code-gen for PIC code and discussion
suggested using a table of offsets instead of a table of actual pointers, so
you just need one base address.

A further optimization is possible when the strings are all similar length,
and/or the longest one isn't much longer than a pointer:

Pad all strings to the same length with trailing 0 bytes, and calculate a
pointer instead of loading it from an array.  This removes the possibility of
multiple entries sharing the same suffix (which is a missed optimization gcc
wasn't already doing), but avoids needing any space for storing pointers in
memory at all.

In the case discussed in bug 84011 (Linux's phy.h const char
*phy_modes(phy_interface_t interface)), the longest strings are 11 bytes
(including the \0), and there are 23 of them.  So it takes 253 bytes of char
data to store everything (not counting the "unknown" for the default: special
case) with all strings padded to 11 bytes.

----

The current strings + pointer-table implementation doesn't merge string
literals where one string is a suffix of another; this is another a
missed-optimization that would save many bytes here.  (e.g. instead of .string
"mii" and .string "gmii", just have .LC4 .byte 's'; .LC3: .byte 'g'; .LC2:
.string "mii".)

That optimization plus byte or 16-bit offsets into the table would be nice and
compact, and most CPUs have efficient zero-extending narrow loads.  So for
cases where the other optimization I'm suggesting isn't good, that would
probably be best.

----

The current packed string-data takes 158 bytes , so with 4-byte offsets it
takes 158+23*4 = 250 bytes.  Or with 8-byte pointers/offsets, it takes 158 +
23*8 = 342 bytes.  Or with 1-byte offsets, 158 + 23*1 = 181 bytes: load with
movzbl.  (If you can't use the offset directly as an 8-byte memory source
operand for ADD to a pointer, there's no point making it 32 bits instead of 8.)

The code for *using* such a table is quite simple.  This C source compiles to
what I'm suggesting:

https://godbolt.org/g/E8J3iS

struct foo {
    char str[11];
} const table[23] = {};

const char *lookup(unsigned long idx) {
    if(idx > 23) {
        return "unknown";
        //idx=23;
    }
    return table[idx].str;
}

Multiply by 11 only takes 2 LEA instructions on x86, so for PIC code with a
RIP-relative LEA we end up with 4 ALU instructions total to get a string
address, after checking the if condition:

       # gcc7.3 -march=haswell -O3 -fPIE output:  https://godbolt.org/g/qMzaY8
        leaq    .LC0(%rip), %rax    # "unknown"
        cmpq    $23, %rdi
        ja      .L4                 # branchless is also an option
        leaq    (%rdi,%rdi,4), %rax
        leaq    table(%rip), %rdx   # RIP-relative table base address
        leaq    (%rdi,%rax,2), %rax
        addq    %rdx, %rax          # table + 11*idx
.L4:
        ret

This is even better in no-PIE mode where a static address is usable as a signed
32-bit immediate:

lookup(unsigned long):
        movl    $.LC0, %eax
        cmpq    $23, %rdi
        ja      .L4
        leaq    (%rdi,%rdi,4), %rax
        leaq    table(%rdi,%rax,2), %rax    # 3 cycle latency for 3-component
LEA on SnB-family
.L4:
        ret

So this has extremely low code-size cost on x86-64, for the benefit of removing
a table load in the dependency chain from enum to string data.  It does cost
significant data size vs. a byte-offset table with suffix-merging, but it's 
better than what gcc is doing now in non-PIE (table of qword pointers), and
*much* better in PIE (insane jump table).

-----

The byte-index version is equivalent to transforming the C source like this:

const char packedstrings[158] = {};
const unsigned char offsets[23] = {};
const char *lookup_byteidx(unsigned long idx) {
    if(idx>23)
        return "unknown";
    return &packedstrings[offsets[idx]];
}

        leaq    .LC0(%rip), %rax      # "unknown"
        cmpq    $23, %rdi
        ja      .L9
        leaq    offsets(%rip), %rax
        leaq    packedstrings(%rip), %rdx
        movzbl  (%rax,%rdi), %eax
        addq    %rdx, %rax
.L9:
        ret

We can save an instruction here by making the relative position of
packedstrings and offsets a compile-time constant, i.e. by effectively putting
them in a struct.  So

        ...
        ja      .L9
        leaq    packedstrings(%rip), %rdx
        movzbl  offsets-packedstrings(%rdx,%rdi), %eax
        add     %rdx, %rax

base+idx + disp8 for a load isn't slower than base+idx on modern x86, unlike
for LEA.  This lets us still use add instead of a 3-component LEA.  Arrange
them so offsets-packedstrings fits in a signed 8-bit integer if possible, so
you can use a short displacement.  (i.e. put the shorter one first, almost
always offsets unless a lot of the strings are duplicated.)

We can represent this idea in C source like this:

struct {
    unsigned char offsets[23];
    char packedstrings[158];
}const stringtab = {};

const char *lookup_stringtab(unsigned long idx) {
    if(idx>23)
        return "unknown";
    unsigned off = stringtab.offsets[idx];
    // TODO: make off relative to stringtab so we avoid a separate +23
    return &stringtab.packedstrings[off];
}

but unfortunately we get this asm with -O3 -march=haswell

        leaq    stringtab(%rip), %rax
        movzbl  (%rax,%rdi), %edx
        leaq    23(%rax,%rdx), %rax     # 3-component LEA

We could optimize away the +23 by baking that into the byte offsets, if we
don't need the full 0..255 range.

        leaq    stringtab(%rip), %rax
        movzbl  (%rax,%rdi), %edx       # offset relative to the start of
stringtab
        add     %rdx, %rax


It's very nice with no-PIE, whether we tell gcc to keep the arrays together or
not:

        movzbl  stringtab(%rdi), %eax
        addq    $packedstrings, %rax 
          # missed opt: addl is safe; address is inside a static object and
thus fits in 32 bits.

These compile fairly nicely for ARM32 and AArch64, even with -fPIE.  ARM32
could maybe save an instruction or two by doing a +440 once and using a
2-register addressing mode, though.  Or choosing a smarter PIC anchor closer to
the data.


I guess this could be broken up into multiple missed-optimization bug reports,
some of which are independent enough to file separately:

1. look for one string literal being a suffix of another (in general,
regardless of tables)

2. pad strings to fixed length and calculate, if it's worth it, for
integer->string mapping functions.  (Need a heuristic to trade off code size
vs. data size vs. removing a level of indirection for latency)

3. use byte or 16-bit offsets to the base of consecutive strings (unless
literal-merging between separate tables means we want 32-bit offsets).  Merging
with literals that aren't part of a different table is fine; they can just
reference a string as part of a table.

4. address one static array relative to the other when they're defined in the
same compilation unit, so save on RIP-relative LEA instructions.  ARM already
does this, but MIPS64 and x86 PIC/PIE miss it.  (At least the old gcc5.4 on
Godbolt does, didn't check newer MIPS gcc).  Especially try to put them next to
each other so it can be a disp8 instead of disp32 on x86.

5. For struct members, put the disp8 on a load so we can use add instead of
3-component LEA, with -mtune=intel.  (If generating this internally for a
string lookup, bake in the offset so you don't need either, if that still fits
in byte offsets.)

https://godbolt.org/g/E8J3iS demonstrates all of these (same link as above).

[Bug tree-optimization/85585] New: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string table

Reply via email to