Re: optimize __gp location

Christian Hildner Mon, 24 Jan 2005 23:32:03 -0800

Keith Owens schrieb:

On Mon, 24 Jan 2005 14:44:22 +0100, Christian Hildner <[EMAIL PROTECTED]> wrote:
Keith Owens schrieb:
When jiffies is within 22 bit range of __gp, the linker writes the
sequence as
  addl r20=offset_of(jiffies,__gp),r1;;
  mov r16=r20;;
  ld8.acq r23=[r16]     // value of jiffies
Is there a restriction to not rewrite to
  addl r16=offset_of(jiffies,__gp),r1;;
  ld8.acq r23=[r16]     // value of jiffies
  nop.i 0
because that would save at least one cycle and would make bundling easier (dependend of additional instructions, of course).
The code snippet was a simplification of what gcc actually does. If you look at some object code, you will find that the 3 instructions are already spread over multiple bundles. Moving the final ld8 upwards cannot save any cycles, you still have to execute the same number of bundles.

But it is one instruction group less. And that relates to at least (here exactly) one cycle.

A real example from kernel/sched.o
4830: 09 50 20 42 00 21 [MMI] adds r10=8,r33 4832: LTOFF22X jiffies 4836: 20 81 84 00 42 c0 adds r18=16,r33 483c: 01 08 00 90 addl r14=0,r1;; 4840: 08 00 08 1e d8 19 [MMI] stf.spill [r15]=f2 4841: LDXMOV jiffies 4842: LTOFF22X __per_cpu_offset 4846: b0 00 38 30 20 40 ld8 r11=[r14] 484c: 03 08 00 90 addl r26=0,r1 4850: 08 a0 00 02 00 24 [MMI] addl r20=0,r1 4850: LTOFF22X .data.percpu+0x440 4856: 90 00 01 20 40 e0 shladd r9=r32,1,r0 485c: 02 00 59 00 sxt4 r23=r32 4860: 08 40 00 14 18 10 [MMI] ld8 r8=[r10] 4866: 10 01 48 30 20 e0 ld8 r17=[r18] 486c: 04 00 c4 00 mov r39=b0 4870: 05 00 00 00 01 40 [MLX] nop.m 0x0 4876: 10 00 00 00 00 60 movl r27=0x10624dd3;; 487c: 33 55 6c 62 4880: 10 00 00 00 01 00 [MIB] nop.m 0x0 4886: f0 40 e0 f0 29 00 shl r15=r8,7 488c: 00 00 00 20 nop.b 0x0 4890: 09 c0 00 34 18 10 [MMI] ld8 r24=[r26] 4890: LDXMOV __per_cpu_offset 4896: 30 00 2c 70 21 40 ld8.acq r3=[r11]
The LDXMOV relocation is designed to make it simple to convert the
instruction from ld8 r11=[r14] to mov r11=r14, it is easy to do in
place.

Ok, simplicity is an argument.

 Moving an entire slot around is a lot messier, for no
performance gain.

You have still one memory unit wasted for the mov logically being a nop. So dependant on the cpu implementation there is a possible loss of one cycle specially for memory intensive code fragments/instructions groups. In the example the LDXMOV instruction group has seven memory units utilized. And if the cpu has only six of them implemented? But I see the complexity when changing that. It would result in the need for another optimizer step. A linker optimizer?

Christian

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: optimize __gp location

Reply via email to