Keith Owens schrieb:

On Mon, 24 Jan 2005 14:44:22 +0100, Christian Hildner <[EMAIL PROTECTED]> wrote:


Keith Owens schrieb:


When jiffies is within 22 bit range of __gp, the linker writes the
sequence as

  addl r20=offset_of(jiffies,__gp),r1;;
  mov r16=r20;;
  ld8.acq r23=[r16]     // value of jiffies



Is there a restriction to not rewrite to

  addl r16=offset_of(jiffies,__gp),r1;;
  ld8.acq r23=[r16]     // value of jiffies
  nop.i 0

because that would save at least one cycle and would make bundling easier (dependend of additional instructions, of course).



The code snippet was a simplification of what gcc actually does. If
you look at some object code, you will find that the 3 instructions are
already spread over multiple bundles. Moving the final ld8 upwards
cannot save any cycles, you still have to execute the same number of
bundles.


But it is one instruction group less. And that relates to at least (here exactly) one cycle.

A real example from kernel/sched.o

4830: 09 50 20 42 00 21 [MMI] adds r10=8,r33
4832: LTOFF22X jiffies
4836: 20 81 84 00 42 c0 adds r18=16,r33
483c: 01 08 00 90 addl r14=0,r1;;
4840: 08 00 08 1e d8 19 [MMI] stf.spill [r15]=f2
4841: LDXMOV jiffies
4842: LTOFF22X __per_cpu_offset
4846: b0 00 38 30 20 40 ld8 r11=[r14]
484c: 03 08 00 90 addl r26=0,r1
4850: 08 a0 00 02 00 24 [MMI] addl r20=0,r1
4850: LTOFF22X .data.percpu+0x440
4856: 90 00 01 20 40 e0 shladd r9=r32,1,r0
485c: 02 00 59 00 sxt4 r23=r32
4860: 08 40 00 14 18 10 [MMI] ld8 r8=[r10]
4866: 10 01 48 30 20 e0 ld8 r17=[r18]
486c: 04 00 c4 00 mov r39=b0
4870: 05 00 00 00 01 40 [MLX] nop.m 0x0
4876: 10 00 00 00 00 60 movl r27=0x10624dd3;;
487c: 33 55 6c 62 4880: 10 00 00 00 01 00 [MIB] nop.m 0x0
4886: f0 40 e0 f0 29 00 shl r15=r8,7
488c: 00 00 00 20 nop.b 0x0
4890: 09 c0 00 34 18 10 [MMI] ld8 r24=[r26]
4890: LDXMOV __per_cpu_offset
4896: 30 00 2c 70 21 40 ld8.acq r3=[r11]


The LDXMOV relocation is designed to make it simple to convert the
instruction from ld8 r11=[r14] to mov r11=r14, it is easy to do in
place.

Ok, simplicity is an argument.

 Moving an entire slot around is a lot messier, for no
performance gain.

You have still one memory unit wasted for the mov logically being a nop. So dependant on the cpu implementation there is a possible loss of one cycle specially for memory intensive code fragments/instructions groups. In the example the LDXMOV instruction group has seven memory units utilized. And if the cpu has only six of them implemented? But I see the complexity when changing that. It would result in the need for another optimizer step. A linker optimizer?

Christian

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to