http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36043
--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Michael Matz from comment #8) > FWIW, I think the error is in the caller of move_block_to_reg. > move_block_to_reg can make use of a load_multiple instruction, which really > loads full regs. I.e. it would be unreasonable to require changes in > move_block_to_reg to handle non-power-of-2 sizes. Hence the caller > (load_register_parameters) needs to handle this. I'm not sure if the > n_aligned_regs thingy could be misused for this, or if one simply should > opencode the special case of the last register being partial. That would be sth like Index: gcc/calls.c =================================================================== --- gcc/calls.c (revision 208124) +++ gcc/calls.c (working copy) @@ -1984,7 +1984,26 @@ load_register_parameters (struct arg_dat emit_move_insn (ri, x); } else - move_block_to_reg (REGNO (reg), mem, nregs, args[i].mode); + { + if (size % UNITS_PER_WORD == 0 + || MEM_ALIGN (mem) % BITS_PER_WORD == 0) + move_block_to_reg (REGNO (reg), mem, nregs, args[i].mode); + else + { + if (nregs > 1) + move_block_to_reg (REGNO (reg), mem, + nregs - 1, args[i].mode); + rtx dest = gen_rtx_REG (word_mode, + REGNO (reg) + nregs - 1); + rtx src = operand_subword_force (mem, + nregs - 1, args[i].mode); + rtx tem = extract_bit_field (src, size * BITS_PER_UNIT, + 0, 1, dest, word_mode, + word_mode); + if (tem != dest) + convert_move (dest, tem, 1); + } + } } /* When a parameter is a block, and perhaps in other cases, it is it's similar to what store_unaligned_arguments_into_pseudos would end up doing but only for the last register (so it's probably easier to dispatch to that and handle !STRICT_ALIGNMENT targets there). Anyway, the generated code is of course "horrible". foo: .LFB0: .cfi_startproc movq %rdi, %rcx movzwl (%rdi), %edx movzwl 2(%rdi), %edi salq $16, %rdi movq %rdi, %rax movzwl 4(%rcx), %edi orq %rdx, %rax salq $32, %rdi orq %rax, %rdi jmp print_colour for some reason extract_bit_field doesn't consider using a 4-byte load for the first part. With AVX one could also use a masked load (and thus implement the extv/insv pattern family? not sure if it is valid to reject non-byte boundary variants). But if we end up using extract_bit_field more and more it's worth optimizing it further to avoid the above mess... (we end up using extract_split_bit_field).