Re: [mpeg2-dev] [PATCH] motion_comp_arm_s.S is not PIC enough

2012-07-18 Thread Rafaël Carré
Le 2012-07-17 23:08, John Reiser a écrit :
 On 07/17/2012 10:19 AM, Rafaël Carré wrote:
 Hello,

 Le 2012-07-16 00:09, John Reiser a écrit :
 Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm
 in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad
 for shared libraries.  Some environments demand that .text actually be
 read-only all the time, yet MC_put_o_16_arm etc require that the addresses
 be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.)
 Even in those environments which permit the dynamic linker to modify the
 .text segment, the runtime cost of doing the relocation can be noticeable.

 The attached patch rewrites the linkage, discarding the tables of addresses
 in favor of tables of offsets.  All transfers are local within each 
 individual
 function, so there can be no interference by processing that occurs
 after assembly, such as link-time re-ordering (even of individual 
 functions.)

 -- John Reiser


 libmpeg2.patch


 Index: libmpeg2/motion_comp_arm_s.S
 ===
 --- libmpeg2/motion_comp_arm_s.S(revision 1205)
 +++ libmpeg2/motion_comp_arm_s.S(working copy)
 @@ -29,9 +29,13 @@
 pld [r1]
  stmfd sp!, {r4-r11, lr} @ R14 is also called LR
 and r4, r1, #3
 -   adr r5, MC_put_o_16_arm_align_jt
 -   add r5, r5, r4, lsl #2
 -   ldr pc, [r5]
 +   ldrb r4, [pc, r4]
 +   add pc, pc, r4, lsl #2

 Is this instruction available on all ARM variants?
 
 The add pc, pc, r4, lsl #2 has the same _form_ as the replaced
 add r5, r5, r4, lsl #2.
 The patched code will assemble correctly for all variants where the
 unpatched code will assemble correctly.
 In particular, all ARM CPU back to at least armv4 have both instructions
 in ARM mode.  The code also executes correctly in ARM mode on armv4 and later.
 Using armv5tel I ran make check successfully against all the streams
 when the working directory was  libmpeg2/trunk/test .
 
 The unpatched file motion_comp_arm_s.S uses
 stmfd sp!, {r4-r11, lr} @ R14 is also called LR
 which because of the use of 'r11' and 'lr' is ARM-only, not Thumb1, not 
 Thumb2.
 Thus we don't need to consider _any_ Thumb variants for the patch.
 
 However, *just* for the sake of complete analysis:
 - begin analysis of Thumb modes; *NOT NEEDED* by patch
 The unpatched add r5, r5, r4, lsl #2 does not exist in Thumb (Thumb1),
 but does exist in Thumb2.  It is not available on armv5t, but is available
 on all higher armv?t CPU because Thumb2 is very desirable and not
 too expensive (in any of chip area, power, licensing fees.)
 
 The remaining question is whether add pc, pc, r4, lsl #2 executes correctly
 in Thumb2.  What value is read from register r15, as input to the 'add'?
 My reference is:
 
   ARM Architecture Reference Manual, ARM DDI 0100E, July 2000
   Section 6.1 About the Thumb instruction set
 
   When R15 is read, bit[0] is zero and bits[31:1] contain the PC. When R15
   is written, bit[0] is IGNORED andbits[31:1] are written to the PC.
   Depending on how it is used, the value of the PC is either the address
   of the instruction plus 4 or is UNPREDICTABLE.
 
 Because the Thumb sequence
   L99:
  mov lr, pc
  b.n  foo
   L100:
 may be used to record a continuation address (it sets r14 to L100, which is
 (4 + L99)), then I believe that the value fetched from r15 is 4+(opcode  
 ~1),
 and not UNPREDICTABLE.
 This also agrees with the exposed 3-stage pipelining of original ARM, where
 the value fetched from r15 is is two words (2*2 in Thumb mode, 2*4 in ARM
 mode) ahead.
 
 So, in Thumb2 mode I believe that the value fetched from r15 by
 add pc, pc, r4, lsl #2 is the address of the byte which immediately follows
 the 'add' instruction, namely byte 0 of the table.  However, the address that
 the patch wants is the address of the 0: just _beyond_ the table.  Therefore
 *IF* we want the same code to be correct for both ARM and Thumb2 at the same 
 time,
 then we must use another register to handle the different value fetched from 
 r15
 by Thumb2 vs ARM:
 
   adr r5, 0f
 ldrb r4, [r5, r4]
 add pc, r5, r4, lsl #2
 0:
 .byte (MC_put_o_16_arm_align0 - 0b)2
 .byte (MC_put_o_16_arm_align1 - 0b)2
 .byte (MC_put_o_16_arm_align2 - 0b)2
 .byte (MC_put_o_16_arm_align3 - 0b)2
 
 In Thumb2 mode only (not Thumb1, not ARM), the sequence
   adr r5, 0f
   tbb [r5, r4]  # Table Branch Byte
 0:
   .byte ...
 is equivalent, and shorter by 4 bytes and faster by two [?] cycles.
 - end analysis of Thumb modes; *NOT NEEDED* by patch
 
 

 I have to ask because I found some restrictions on:
 http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html
 
 I see a background watermark Superseded on that page.  Also, the document is
 for Thumb1 and not Thumb2.  Therefore I believe that the document does not 
 apply,
 because the unpatched add r5, r5, 

Re: [mpeg2-dev] [PATCH] motion_comp_arm_s.S is not PIC enough

2012-07-17 Thread Rafaël Carré
Hello,

Le 2012-07-16 00:09, John Reiser a écrit :
 Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm
 in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad
 for shared libraries.  Some environments demand that .text actually be
 read-only all the time, yet MC_put_o_16_arm etc require that the addresses
 be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.)
 Even in those environments which permit the dynamic linker to modify the
 .text segment, the runtime cost of doing the relocation can be noticeable.
 
 The attached patch rewrites the linkage, discarding the tables of addresses
 in favor of tables of offsets.  All transfers are local within each individual
 function, so there can be no interference by processing that occurs
 after assembly, such as link-time re-ordering (even of individual functions.)
 
 -- John Reiser
 
 
 libmpeg2.patch
 
 
 Index: libmpeg2/motion_comp_arm_s.S
 ===
 --- libmpeg2/motion_comp_arm_s.S  (revision 1205)
 +++ libmpeg2/motion_comp_arm_s.S  (working copy)
 @@ -29,9 +29,13 @@
   pld [r1]
  stmfd sp!, {r4-r11, lr} @ R14 is also called LR
   and r4, r1, #3
 - adr r5, MC_put_o_16_arm_align_jt
 - add r5, r5, r4, lsl #2
 - ldr pc, [r5]
 + ldrb r4, [pc, r4]
 + add pc, pc, r4, lsl #2

Is this instruction available on all ARM variants?

I have to ask because I found some restrictions on:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html

Although here it should be the form ADD Rd, Rn, #imm8m which works
everywhere.

 + .byte (MC_put_o_16_arm_align0 - 0f)2
 + .byte (MC_put_o_16_arm_align1 - 0f)2
 + .byte (MC_put_o_16_arm_align2 - 0f)2
 + .byte (MC_put_o_16_arm_align3 - 0f)2
 +0:


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Libmpeg2-devel mailing list
Libmpeg2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmpeg2-devel


Re: [mpeg2-dev] [PATCH] motion_comp_arm_s.S is not PIC enough

2012-07-17 Thread John Reiser
On 07/17/2012 10:19 AM, Rafaël Carré wrote:
 Hello,
 
 Le 2012-07-16 00:09, John Reiser a écrit :
 Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm
 in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad
 for shared libraries.  Some environments demand that .text actually be
 read-only all the time, yet MC_put_o_16_arm etc require that the addresses
 be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.)
 Even in those environments which permit the dynamic linker to modify the
 .text segment, the runtime cost of doing the relocation can be noticeable.

 The attached patch rewrites the linkage, discarding the tables of addresses
 in favor of tables of offsets.  All transfers are local within each 
 individual
 function, so there can be no interference by processing that occurs
 after assembly, such as link-time re-ordering (even of individual functions.)

 -- John Reiser


 libmpeg2.patch


 Index: libmpeg2/motion_comp_arm_s.S
 ===
 --- libmpeg2/motion_comp_arm_s.S (revision 1205)
 +++ libmpeg2/motion_comp_arm_s.S (working copy)
 @@ -29,9 +29,13 @@
  pld [r1]
  stmfd sp!, {r4-r11, lr} @ R14 is also called LR
  and r4, r1, #3
 -adr r5, MC_put_o_16_arm_align_jt
 -add r5, r5, r4, lsl #2
 -ldr pc, [r5]
 +ldrb r4, [pc, r4]
 +add pc, pc, r4, lsl #2
 
 Is this instruction available on all ARM variants?

The add pc, pc, r4, lsl #2 has the same _form_ as the replaced
add r5, r5, r4, lsl #2.
The patched code will assemble correctly for all variants where the
unpatched code will assemble correctly.
In particular, all ARM CPU back to at least armv4 have both instructions
in ARM mode.  The code also executes correctly in ARM mode on armv4 and later.
Using armv5tel I ran make check successfully against all the streams
when the working directory was  libmpeg2/trunk/test .

The unpatched file motion_comp_arm_s.S uses
stmfd sp!, {r4-r11, lr} @ R14 is also called LR
which because of the use of 'r11' and 'lr' is ARM-only, not Thumb1, not Thumb2.
Thus we don't need to consider _any_ Thumb variants for the patch.

However, *just* for the sake of complete analysis:
- begin analysis of Thumb modes; *NOT NEEDED* by patch
The unpatched add r5, r5, r4, lsl #2 does not exist in Thumb (Thumb1),
but does exist in Thumb2.  It is not available on armv5t, but is available
on all higher armv?t CPU because Thumb2 is very desirable and not
too expensive (in any of chip area, power, licensing fees.)

The remaining question is whether add pc, pc, r4, lsl #2 executes correctly
in Thumb2.  What value is read from register r15, as input to the 'add'?
My reference is:

  ARM Architecture Reference Manual, ARM DDI 0100E, July 2000
  Section 6.1 About the Thumb instruction set

  When R15 is read, bit[0] is zero and bits[31:1] contain the PC. When R15
  is written, bit[0] is IGNORED andbits[31:1] are written to the PC.
  Depending on how it is used, the value of the PC is either the address
  of the instruction plus 4 or is UNPREDICTABLE.

Because the Thumb sequence
  L99:
 mov lr, pc
 b.n  foo
  L100:
may be used to record a continuation address (it sets r14 to L100, which is
(4 + L99)), then I believe that the value fetched from r15 is 4+(opcode  ~1),
and not UNPREDICTABLE.
This also agrees with the exposed 3-stage pipelining of original ARM, where
the value fetched from r15 is is two words (2*2 in Thumb mode, 2*4 in ARM
mode) ahead.

So, in Thumb2 mode I believe that the value fetched from r15 by
add pc, pc, r4, lsl #2 is the address of the byte which immediately follows
the 'add' instruction, namely byte 0 of the table.  However, the address that
the patch wants is the address of the 0: just _beyond_ the table.  Therefore
*IF* we want the same code to be correct for both ARM and Thumb2 at the same 
time,
then we must use another register to handle the different value fetched from r15
by Thumb2 vs ARM:

adr r5, 0f
ldrb r4, [r5, r4]
add pc, r5, r4, lsl #2
0:
.byte (MC_put_o_16_arm_align0 - 0b)2
.byte (MC_put_o_16_arm_align1 - 0b)2
.byte (MC_put_o_16_arm_align2 - 0b)2
.byte (MC_put_o_16_arm_align3 - 0b)2

In Thumb2 mode only (not Thumb1, not ARM), the sequence
adr r5, 0f
tbb [r5, r4]  # Table Branch Byte
0:
.byte ...
is equivalent, and shorter by 4 bytes and faster by two [?] cycles.
- end analysis of Thumb modes; *NOT NEEDED* by patch


 
 I have to ask because I found some restrictions on:
 http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html

I see a background watermark Superseded on that page.  Also, the document is
for Thumb1 and not Thumb2.  Therefore I believe that the document does not 
apply,
because the unpatched add r5, r5, r4, lsl #2 would require Thumb2.  [Remember
also that the unpatched code won't assemble for Thumb2 anyway