Re: [mpeg2-dev] [PATCH] motion_comp_arm_s.S is not PIC enough
Le 2012-07-17 23:08, John Reiser a écrit : On 07/17/2012 10:19 AM, Rafaël Carré wrote: Hello, Le 2012-07-16 00:09, John Reiser a écrit : Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad for shared libraries. Some environments demand that .text actually be read-only all the time, yet MC_put_o_16_arm etc require that the addresses be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.) Even in those environments which permit the dynamic linker to modify the .text segment, the runtime cost of doing the relocation can be noticeable. The attached patch rewrites the linkage, discarding the tables of addresses in favor of tables of offsets. All transfers are local within each individual function, so there can be no interference by processing that occurs after assembly, such as link-time re-ordering (even of individual functions.) -- John Reiser libmpeg2.patch Index: libmpeg2/motion_comp_arm_s.S === --- libmpeg2/motion_comp_arm_s.S(revision 1205) +++ libmpeg2/motion_comp_arm_s.S(working copy) @@ -29,9 +29,13 @@ pld [r1] stmfd sp!, {r4-r11, lr} @ R14 is also called LR and r4, r1, #3 - adr r5, MC_put_o_16_arm_align_jt - add r5, r5, r4, lsl #2 - ldr pc, [r5] + ldrb r4, [pc, r4] + add pc, pc, r4, lsl #2 Is this instruction available on all ARM variants? The add pc, pc, r4, lsl #2 has the same _form_ as the replaced add r5, r5, r4, lsl #2. The patched code will assemble correctly for all variants where the unpatched code will assemble correctly. In particular, all ARM CPU back to at least armv4 have both instructions in ARM mode. The code also executes correctly in ARM mode on armv4 and later. Using armv5tel I ran make check successfully against all the streams when the working directory was libmpeg2/trunk/test . The unpatched file motion_comp_arm_s.S uses stmfd sp!, {r4-r11, lr} @ R14 is also called LR which because of the use of 'r11' and 'lr' is ARM-only, not Thumb1, not Thumb2. Thus we don't need to consider _any_ Thumb variants for the patch. However, *just* for the sake of complete analysis: - begin analysis of Thumb modes; *NOT NEEDED* by patch The unpatched add r5, r5, r4, lsl #2 does not exist in Thumb (Thumb1), but does exist in Thumb2. It is not available on armv5t, but is available on all higher armv?t CPU because Thumb2 is very desirable and not too expensive (in any of chip area, power, licensing fees.) The remaining question is whether add pc, pc, r4, lsl #2 executes correctly in Thumb2. What value is read from register r15, as input to the 'add'? My reference is: ARM Architecture Reference Manual, ARM DDI 0100E, July 2000 Section 6.1 About the Thumb instruction set When R15 is read, bit[0] is zero and bits[31:1] contain the PC. When R15 is written, bit[0] is IGNORED andbits[31:1] are written to the PC. Depending on how it is used, the value of the PC is either the address of the instruction plus 4 or is UNPREDICTABLE. Because the Thumb sequence L99: mov lr, pc b.n foo L100: may be used to record a continuation address (it sets r14 to L100, which is (4 + L99)), then I believe that the value fetched from r15 is 4+(opcode ~1), and not UNPREDICTABLE. This also agrees with the exposed 3-stage pipelining of original ARM, where the value fetched from r15 is is two words (2*2 in Thumb mode, 2*4 in ARM mode) ahead. So, in Thumb2 mode I believe that the value fetched from r15 by add pc, pc, r4, lsl #2 is the address of the byte which immediately follows the 'add' instruction, namely byte 0 of the table. However, the address that the patch wants is the address of the 0: just _beyond_ the table. Therefore *IF* we want the same code to be correct for both ARM and Thumb2 at the same time, then we must use another register to handle the different value fetched from r15 by Thumb2 vs ARM: adr r5, 0f ldrb r4, [r5, r4] add pc, r5, r4, lsl #2 0: .byte (MC_put_o_16_arm_align0 - 0b)2 .byte (MC_put_o_16_arm_align1 - 0b)2 .byte (MC_put_o_16_arm_align2 - 0b)2 .byte (MC_put_o_16_arm_align3 - 0b)2 In Thumb2 mode only (not Thumb1, not ARM), the sequence adr r5, 0f tbb [r5, r4] # Table Branch Byte 0: .byte ... is equivalent, and shorter by 4 bytes and faster by two [?] cycles. - end analysis of Thumb modes; *NOT NEEDED* by patch I have to ask because I found some restrictions on: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html I see a background watermark Superseded on that page. Also, the document is for Thumb1 and not Thumb2. Therefore I believe that the document does not apply, because the unpatched add r5, r5,
Re: [mpeg2-dev] [PATCH] motion_comp_arm_s.S is not PIC enough
Hello, Le 2012-07-16 00:09, John Reiser a écrit : Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad for shared libraries. Some environments demand that .text actually be read-only all the time, yet MC_put_o_16_arm etc require that the addresses be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.) Even in those environments which permit the dynamic linker to modify the .text segment, the runtime cost of doing the relocation can be noticeable. The attached patch rewrites the linkage, discarding the tables of addresses in favor of tables of offsets. All transfers are local within each individual function, so there can be no interference by processing that occurs after assembly, such as link-time re-ordering (even of individual functions.) -- John Reiser libmpeg2.patch Index: libmpeg2/motion_comp_arm_s.S === --- libmpeg2/motion_comp_arm_s.S (revision 1205) +++ libmpeg2/motion_comp_arm_s.S (working copy) @@ -29,9 +29,13 @@ pld [r1] stmfd sp!, {r4-r11, lr} @ R14 is also called LR and r4, r1, #3 - adr r5, MC_put_o_16_arm_align_jt - add r5, r5, r4, lsl #2 - ldr pc, [r5] + ldrb r4, [pc, r4] + add pc, pc, r4, lsl #2 Is this instruction available on all ARM variants? I have to ask because I found some restrictions on: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html Although here it should be the form ADD Rd, Rn, #imm8m which works everywhere. + .byte (MC_put_o_16_arm_align0 - 0f)2 + .byte (MC_put_o_16_arm_align1 - 0f)2 + .byte (MC_put_o_16_arm_align2 - 0f)2 + .byte (MC_put_o_16_arm_align3 - 0f)2 +0: -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Libmpeg2-devel mailing list Libmpeg2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmpeg2-devel
Re: [mpeg2-dev] [PATCH] motion_comp_arm_s.S is not PIC enough
On 07/17/2012 10:19 AM, Rafaël Carré wrote: Hello, Le 2012-07-16 00:09, John Reiser a écrit : Functions MC_put_o_16_arm, MC_put_o_8_arm, MC_put_x_16_arm, MC_put_x_8_arm in libmpeg2/motion_comp_arm_s.S have addresses in .text, which is bad for shared libraries. Some environments demand that .text actually be read-only all the time, yet MC_put_o_16_arm etc require that the addresses be modified by the dynamic linking mechanism (dlopen, LoadLibrary, etc.) Even in those environments which permit the dynamic linker to modify the .text segment, the runtime cost of doing the relocation can be noticeable. The attached patch rewrites the linkage, discarding the tables of addresses in favor of tables of offsets. All transfers are local within each individual function, so there can be no interference by processing that occurs after assembly, such as link-time re-ordering (even of individual functions.) -- John Reiser libmpeg2.patch Index: libmpeg2/motion_comp_arm_s.S === --- libmpeg2/motion_comp_arm_s.S (revision 1205) +++ libmpeg2/motion_comp_arm_s.S (working copy) @@ -29,9 +29,13 @@ pld [r1] stmfd sp!, {r4-r11, lr} @ R14 is also called LR and r4, r1, #3 -adr r5, MC_put_o_16_arm_align_jt -add r5, r5, r4, lsl #2 -ldr pc, [r5] +ldrb r4, [pc, r4] +add pc, pc, r4, lsl #2 Is this instruction available on all ARM variants? The add pc, pc, r4, lsl #2 has the same _form_ as the replaced add r5, r5, r4, lsl #2. The patched code will assemble correctly for all variants where the unpatched code will assemble correctly. In particular, all ARM CPU back to at least armv4 have both instructions in ARM mode. The code also executes correctly in ARM mode on armv4 and later. Using armv5tel I ran make check successfully against all the streams when the working directory was libmpeg2/trunk/test . The unpatched file motion_comp_arm_s.S uses stmfd sp!, {r4-r11, lr} @ R14 is also called LR which because of the use of 'r11' and 'lr' is ARM-only, not Thumb1, not Thumb2. Thus we don't need to consider _any_ Thumb variants for the patch. However, *just* for the sake of complete analysis: - begin analysis of Thumb modes; *NOT NEEDED* by patch The unpatched add r5, r5, r4, lsl #2 does not exist in Thumb (Thumb1), but does exist in Thumb2. It is not available on armv5t, but is available on all higher armv?t CPU because Thumb2 is very desirable and not too expensive (in any of chip area, power, licensing fees.) The remaining question is whether add pc, pc, r4, lsl #2 executes correctly in Thumb2. What value is read from register r15, as input to the 'add'? My reference is: ARM Architecture Reference Manual, ARM DDI 0100E, July 2000 Section 6.1 About the Thumb instruction set When R15 is read, bit[0] is zero and bits[31:1] contain the PC. When R15 is written, bit[0] is IGNORED andbits[31:1] are written to the PC. Depending on how it is used, the value of the PC is either the address of the instruction plus 4 or is UNPREDICTABLE. Because the Thumb sequence L99: mov lr, pc b.n foo L100: may be used to record a continuation address (it sets r14 to L100, which is (4 + L99)), then I believe that the value fetched from r15 is 4+(opcode ~1), and not UNPREDICTABLE. This also agrees with the exposed 3-stage pipelining of original ARM, where the value fetched from r15 is is two words (2*2 in Thumb mode, 2*4 in ARM mode) ahead. So, in Thumb2 mode I believe that the value fetched from r15 by add pc, pc, r4, lsl #2 is the address of the byte which immediately follows the 'add' instruction, namely byte 0 of the table. However, the address that the patch wants is the address of the 0: just _beyond_ the table. Therefore *IF* we want the same code to be correct for both ARM and Thumb2 at the same time, then we must use another register to handle the different value fetched from r15 by Thumb2 vs ARM: adr r5, 0f ldrb r4, [r5, r4] add pc, r5, r4, lsl #2 0: .byte (MC_put_o_16_arm_align0 - 0b)2 .byte (MC_put_o_16_arm_align1 - 0b)2 .byte (MC_put_o_16_arm_align2 - 0b)2 .byte (MC_put_o_16_arm_align3 - 0b)2 In Thumb2 mode only (not Thumb1, not ARM), the sequence adr r5, 0f tbb [r5, r4] # Table Branch Byte 0: .byte ... is equivalent, and shorter by 4 bytes and faster by two [?] cycles. - end analysis of Thumb modes; *NOT NEEDED* by patch I have to ask because I found some restrictions on: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/BABDCBAB.html I see a background watermark Superseded on that page. Also, the document is for Thumb1 and not Thumb2. Therefore I believe that the document does not apply, because the unpatched add r5, r5, r4, lsl #2 would require Thumb2. [Remember also that the unpatched code won't assemble for Thumb2 anyway