https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585
Bug ID: 85585 Summary: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string table Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Bug 84011 shows some really silly code-gen for PIC code and discussion suggested using a table of offsets instead of a table of actual pointers, so you just need one base address. A further optimization is possible when the strings are all similar length, and/or the longest one isn't much longer than a pointer: Pad all strings to the same length with trailing 0 bytes, and calculate a pointer instead of loading it from an array. This removes the possibility of multiple entries sharing the same suffix (which is a missed optimization gcc wasn't already doing), but avoids needing any space for storing pointers in memory at all. In the case discussed in bug 84011 (Linux's phy.h const char *phy_modes(phy_interface_t interface)), the longest strings are 11 bytes (including the \0), and there are 23 of them. So it takes 253 bytes of char data to store everything (not counting the "unknown" for the default: special case) with all strings padded to 11 bytes. ---- The current strings + pointer-table implementation doesn't merge string literals where one string is a suffix of another; this is another a missed-optimization that would save many bytes here. (e.g. instead of .string "mii" and .string "gmii", just have .LC4 .byte 's'; .LC3: .byte 'g'; .LC2: .string "mii".) That optimization plus byte or 16-bit offsets into the table would be nice and compact, and most CPUs have efficient zero-extending narrow loads. So for cases where the other optimization I'm suggesting isn't good, that would probably be best. ---- The current packed string-data takes 158 bytes , so with 4-byte offsets it takes 158+23*4 = 250 bytes. Or with 8-byte pointers/offsets, it takes 158 + 23*8 = 342 bytes. Or with 1-byte offsets, 158 + 23*1 = 181 bytes: load with movzbl. (If you can't use the offset directly as an 8-byte memory source operand for ADD to a pointer, there's no point making it 32 bits instead of 8.) The code for *using* such a table is quite simple. This C source compiles to what I'm suggesting: https://godbolt.org/g/E8J3iS struct foo { char str[11]; } const table[23] = {}; const char *lookup(unsigned long idx) { if(idx > 23) { return "unknown"; //idx=23; } return table[idx].str; } Multiply by 11 only takes 2 LEA instructions on x86, so for PIC code with a RIP-relative LEA we end up with 4 ALU instructions total to get a string address, after checking the if condition: # gcc7.3 -march=haswell -O3 -fPIE output: https://godbolt.org/g/qMzaY8 leaq .LC0(%rip), %rax # "unknown" cmpq $23, %rdi ja .L4 # branchless is also an option leaq (%rdi,%rdi,4), %rax leaq table(%rip), %rdx # RIP-relative table base address leaq (%rdi,%rax,2), %rax addq %rdx, %rax # table + 11*idx .L4: ret This is even better in no-PIE mode where a static address is usable as a signed 32-bit immediate: lookup(unsigned long): movl $.LC0, %eax cmpq $23, %rdi ja .L4 leaq (%rdi,%rdi,4), %rax leaq table(%rdi,%rax,2), %rax # 3 cycle latency for 3-component LEA on SnB-family .L4: ret So this has extremely low code-size cost on x86-64, for the benefit of removing a table load in the dependency chain from enum to string data. It does cost significant data size vs. a byte-offset table with suffix-merging, but it's better than what gcc is doing now in non-PIE (table of qword pointers), and *much* better in PIE (insane jump table). ----- The byte-index version is equivalent to transforming the C source like this: const char packedstrings[158] = {}; const unsigned char offsets[23] = {}; const char *lookup_byteidx(unsigned long idx) { if(idx>23) return "unknown"; return &packedstrings[offsets[idx]]; } leaq .LC0(%rip), %rax # "unknown" cmpq $23, %rdi ja .L9 leaq offsets(%rip), %rax leaq packedstrings(%rip), %rdx movzbl (%rax,%rdi), %eax addq %rdx, %rax .L9: ret We can save an instruction here by making the relative position of packedstrings and offsets a compile-time constant, i.e. by effectively putting them in a struct. So ... ja .L9 leaq packedstrings(%rip), %rdx movzbl offsets-packedstrings(%rdx,%rdi), %eax add %rdx, %rax base+idx + disp8 for a load isn't slower than base+idx on modern x86, unlike for LEA. This lets us still use add instead of a 3-component LEA. Arrange them so offsets-packedstrings fits in a signed 8-bit integer if possible, so you can use a short displacement. (i.e. put the shorter one first, almost always offsets unless a lot of the strings are duplicated.) We can represent this idea in C source like this: struct { unsigned char offsets[23]; char packedstrings[158]; }const stringtab = {}; const char *lookup_stringtab(unsigned long idx) { if(idx>23) return "unknown"; unsigned off = stringtab.offsets[idx]; // TODO: make off relative to stringtab so we avoid a separate +23 return &stringtab.packedstrings[off]; } but unfortunately we get this asm with -O3 -march=haswell leaq stringtab(%rip), %rax movzbl (%rax,%rdi), %edx leaq 23(%rax,%rdx), %rax # 3-component LEA We could optimize away the +23 by baking that into the byte offsets, if we don't need the full 0..255 range. leaq stringtab(%rip), %rax movzbl (%rax,%rdi), %edx # offset relative to the start of stringtab add %rdx, %rax It's very nice with no-PIE, whether we tell gcc to keep the arrays together or not: movzbl stringtab(%rdi), %eax addq $packedstrings, %rax # missed opt: addl is safe; address is inside a static object and thus fits in 32 bits. These compile fairly nicely for ARM32 and AArch64, even with -fPIE. ARM32 could maybe save an instruction or two by doing a +440 once and using a 2-register addressing mode, though. Or choosing a smarter PIC anchor closer to the data. I guess this could be broken up into multiple missed-optimization bug reports, some of which are independent enough to file separately: 1. look for one string literal being a suffix of another (in general, regardless of tables) 2. pad strings to fixed length and calculate, if it's worth it, for integer->string mapping functions. (Need a heuristic to trade off code size vs. data size vs. removing a level of indirection for latency) 3. use byte or 16-bit offsets to the base of consecutive strings (unless literal-merging between separate tables means we want 32-bit offsets). Merging with literals that aren't part of a different table is fine; they can just reference a string as part of a table. 4. address one static array relative to the other when they're defined in the same compilation unit, so save on RIP-relative LEA instructions. ARM already does this, but MIPS64 and x86 PIC/PIE miss it. (At least the old gcc5.4 on Godbolt does, didn't check newer MIPS gcc). Especially try to put them next to each other so it can be a disp8 instead of disp32 on x86. 5. For struct members, put the disp8 on a load so we can use add instead of 3-component LEA, with -mtune=intel. (If generating this internally for a string lookup, bake in the offset so you don't need either, if that still fits in byte offsets.) https://godbolt.org/g/E8J3iS demonstrates all of these (same link as above).