Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_v

2023-09-26 Thread Martin Storsjö

Hi,

Thanks, this looks mostly ok now.

There were a few minor issues left that I can fix up before pushing. There 
were a number of cases with register restoring like this:


ldr x30, [sp]
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
add sp, sp, #48

Here we should fold the sp update ino the load as well, like this:

ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48

In a few cases, this wasn't possible, due to the location of the register 
that is being restored, like this:


ldr x30, [sp, #56]
add sp, sp, #64

For the most idiomatic aarch64 assembly, I think it would be good to 
restructure it to keep x30 at the bottom of this area, to allow using the 
same pattern for restores here too.


I pushed it with these changes. A later patch to restructure the register 
saves to avoid the separate "add sp" would be appreciated though. I quite 
certainly doesn't matter from a real-world performance perspective, but it 
would make the code more idiomatic.


// Martin




On Sat, 23 Sep 2023, Logan.Lyu wrote:


Hi, Martin,

Thanks for your review.

Thanks for the patches. Functionally, they seem to work, and the issues i 
saw in the code are relatively minor. Unfortunately, some of the issues are 
issues that we've been through in many earlier patches, so I would hope 
that you would pay attention to them in the future before posting more 
patches.
Okay, I have noticed the previous issues and made some modifications 
according to the issues, And I have completed the modifications based on your 
comments.


If there are any missing issues that have not been corrected, please let me 
know.




在 2023/9/17 5:46, Martin Storsjö 写道:

On Thu, 14 Sep 2023, Logan.Lyu wrote:


Hi Martin,

You can try the attached patchset. If that doesn't work, My code branch 
address is https://github.com/myais2023/FFmpeg/tree/hevc-aarch64


Thanks for the patches. Functionally, they seem to work, and the issues i 
saw in the code are relatively minor. Unfortunately, some of the issues are 
issues that we've been through in many earlier patches, so I would hope 
that you would pay attention to them in the future before posting more 
patches.



In patch 1, you've got a bunch of sxtw instructions for src/dst stride 
parameters that have the type ptrdiff_t - that shouldn't be necessary?


In patch 2, you're moving the macros calc_epelh, calc_epelh2, 
load_epel_filterh - can you split out the move into a separate commit? 
(This isn't strictly necessary but would make things even clearer.)


In patch 2, you're storing below the stack, then decrementing it afterwards 
- e.g. like this:



+    stp x0, x30, [sp, #-16]
+    stp x1, x2, [sp, #-32]
+    stp x3, x4, [sp, #-48]
+    stp x5, x6, [sp, #-64]!


Please change that so that you're first predecrementing the whole area, 
then storing the other elements above that stack pointer, e.g. like this:


stp x0, x30, [sp, #-64]!
stp x1, x2, [sp, #16]
stp x3, x4, [sp, #32]

etc.

The same issue also appears in variouos places within functions like this:


+    stp x0, x1, [sp, #-16]
+    stp x4, x6, [sp, #-32]
+    stp xzr, x30, [sp, #-48]!


Please fix all of these cases - you can search through your patches for 
anything related to storing on the stack. Also, storing xzr here seems 
superfluous - if you've got an odd number of registers to store, just make 
one instruction str instead of stp (but keep the stack aligned).


Then in patch 4, you've got yet another pattern for doing these stores, 
where you have superfluous consecutive stack decrements like this:



+    stp x6, x30, [sp, #-16]!
+    mov x7, #16
+    stp x0, x1, [sp, #-16]!
+    stp x2, x3, [sp, #-16]!
+    stp x4, x5, [sp, #-16]!


Please just do one stack decrement covering all the stack space you need.

I believe these issues have been raised in earlier reviews as well.

// Martin



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_v

2023-09-22 Thread Logan.Lyu

Hi, Martin,

Thanks for your review.

Thanks for the patches. Functionally, they seem to work, and the 
issues i saw in the code are relatively minor. Unfortunately, some of 
the issues are issues that we've been through in many earlier patches, 
so I would hope that you would pay attention to them in the future 
before posting more patches.
Okay, I have noticed the previous issues and made some modifications 
according to the issues, And I have completed the modifications based on 
your comments.


If there are any missing issues that have not been corrected, please let 
me know.




在 2023/9/17 5:46, Martin Storsjö 写道:

On Thu, 14 Sep 2023, Logan.Lyu wrote:


Hi Martin,

You can try the attached patchset. If that doesn't work, My code 
branch address is https://github.com/myais2023/FFmpeg/tree/hevc-aarch64


Thanks for the patches. Functionally, they seem to work, and the 
issues i saw in the code are relatively minor. Unfortunately, some of 
the issues are issues that we've been through in many earlier patches, 
so I would hope that you would pay attention to them in the future 
before posting more patches.



In patch 1, you've got a bunch of sxtw instructions for src/dst stride 
parameters that have the type ptrdiff_t - that shouldn't be necessary?


In patch 2, you're moving the macros calc_epelh, calc_epelh2, 
load_epel_filterh - can you split out the move into a separate commit? 
(This isn't strictly necessary but would make things even clearer.)


In patch 2, you're storing below the stack, then decrementing it 
afterwards - e.g. like this:



+    stp x0, x30, [sp, #-16]
+    stp x1, x2, [sp, #-32]
+    stp x3, x4, [sp, #-48]
+    stp x5, x6, [sp, #-64]!


Please change that so that you're first predecrementing the whole 
area, then storing the other elements above that stack pointer, e.g. 
like this:


stp x0, x30, [sp, #-64]!
stp x1, x2, [sp, #16]
stp x3, x4, [sp, #32]

etc.

The same issue also appears in variouos places within functions like 
this:



+    stp x0, x1, [sp, #-16]
+    stp x4, x6, [sp, #-32]
+    stp xzr, x30, [sp, #-48]!


Please fix all of these cases - you can search through your patches 
for anything related to storing on the stack. Also, storing xzr here 
seems superfluous - if you've got an odd number of registers to store, 
just make one instruction str instead of stp (but keep the stack 
aligned).


Then in patch 4, you've got yet another pattern for doing these 
stores, where you have superfluous consecutive stack decrements like 
this:



+    stp x6, x30, [sp, #-16]!
+    mov x7, #16
+    stp x0, x1, [sp, #-16]!
+    stp x2, x3, [sp, #-16]!
+    stp x4, x5, [sp, #-16]!


Please just do one stack decrement covering all the stack space you need.

I believe these issues have been raised in earlier reviews as well.

// Martin
From 62a59aa1fb7bc684ca0c216fd039dd0f231ad0c0 Mon Sep 17 00:00:00 2001
From: Logan Lyu 
Date: Tue, 15 Aug 2023 16:42:25 +0800
Subject: [PATCH 04/10] lavc/aarch64: new optimization for 8-bit
 hevc_qpel_uni_v

checkasm bench:
put_hevc_qpel_uni_v4_8_c: 146.2
put_hevc_qpel_uni_v4_8_neon: 43.2
put_hevc_qpel_uni_v6_8_c: 303.9
put_hevc_qpel_uni_v6_8_neon: 69.7
put_hevc_qpel_uni_v8_8_c: 495.2
put_hevc_qpel_uni_v8_8_neon: 74.7
put_hevc_qpel_uni_v12_8_c: 1100.9
put_hevc_qpel_uni_v12_8_neon: 222.4
put_hevc_qpel_uni_v16_8_c: 1955.2
put_hevc_qpel_uni_v16_8_neon: 269.2
put_hevc_qpel_uni_v24_8_c: 4571.9
put_hevc_qpel_uni_v24_8_neon: 832.4
put_hevc_qpel_uni_v32_8_c: 8226.4
put_hevc_qpel_uni_v32_8_neon: 1035.7
put_hevc_qpel_uni_v48_8_c: 18324.2
put_hevc_qpel_uni_v48_8_neon: 2321.2
put_hevc_qpel_uni_v64_8_c: 37659.4
put_hevc_qpel_uni_v64_8_neon: 4122.2

Co-Authored-By: J. Dekker 
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S| 221 ++
 2 files changed, 226 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index d78954f440..51d212ff72 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -192,6 +192,10 @@ NEON8_FNPROTO(qpel_h, (int16_t *dst,
 const uint8_t *_src, ptrdiff_t _srcstride,
 int height, intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO(qpel_uni_v, (uint8_t *dst,  ptrdiff_t dststride,
+const uint8_t *src, ptrdiff_t srcstride,
+int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
 const uint8_t *_src, ptrdiff_t _srcstride,
 int height, int denom, int wx, int ox,
@@ -295,6 +299,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 NEON8_FNASSIGN(c->put_hevc_epel_uni, 0, 0, pel_uni_pixels,);
 

Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_v

2023-09-16 Thread Martin Storsjö

On Thu, 14 Sep 2023, Logan.Lyu wrote:


Hi Martin,

You can try the attached patchset. If that doesn't work, My code branch 
address is https://github.com/myais2023/FFmpeg/tree/hevc-aarch64


Thanks for the patches. Functionally, they seem to work, and the issues i 
saw in the code are relatively minor. Unfortunately, some of the issues 
are issues that we've been through in many earlier patches, so I would 
hope that you would pay attention to them in the future before posting 
more patches.



In patch 1, you've got a bunch of sxtw instructions for src/dst stride 
parameters that have the type ptrdiff_t - that shouldn't be necessary?


In patch 2, you're moving the macros calc_epelh, calc_epelh2, 
load_epel_filterh - can you split out the move into a separate commit? 
(This isn't strictly necessary but would make things even clearer.)


In patch 2, you're storing below the stack, then decrementing it 
afterwards - e.g. like this:



+stp x0, x30, [sp, #-16]
+stp x1, x2, [sp, #-32]
+stp x3, x4, [sp, #-48]
+stp x5, x6, [sp, #-64]!


Please change that so that you're first predecrementing the whole area, 
then storing the other elements above that stack pointer, e.g. like this:


stp x0, x30, [sp, #-64]!
stp x1, x2, [sp, #16]
stp x3, x4, [sp, #32]

etc.

The same issue also appears in variouos places within functions like this:


+stp x0, x1, [sp, #-16]
+stp x4, x6, [sp, #-32]
+stp xzr, x30, [sp, #-48]!


Please fix all of these cases - you can search through your patches for 
anything related to storing on the stack. Also, storing xzr here seems 
superfluous - if you've got an odd number of registers to store, just make 
one instruction str instead of stp (but keep the stack aligned).


Then in patch 4, you've got yet another pattern for doing these stores, 
where you have superfluous consecutive stack decrements like this:



+stp x6, x30, [sp, #-16]!
+mov x7, #16
+stp x0, x1, [sp, #-16]!
+stp x2, x3, [sp, #-16]!
+stp x4, x5, [sp, #-16]!


Please just do one stack decrement covering all the stack space you need.

I believe these issues have been raised in earlier reviews as well.

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_v

2023-09-13 Thread Logan.Lyu

Hi Martin,

You can try the attached patchset. If that doesn't work, My code branch 
address is https://github.com/myais2023/FFmpeg/tree/hevc-aarch64


Please try it again.

Thanks


在 2023/9/12 19:48, Martin Storsjö 写道:

Hi,

Sorry for not tending to your patches sooner.

Unfortunately, this patchset is impossible to apply - there seems to 
be garbled whitespace in the patch which would require me to manually 
apply all the changes.


Can you try sending the patches again in a way that doesn't corrupt 
whitespace? If not, can you push the branch somewhere where I can 
fetch it?


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".From 022535be4fc50e807870e5e8d1f6449f466d061d Mon Sep 17 00:00:00 2001
From: Logan Lyu 
Date: Tue, 15 Aug 2023 15:24:32 +0800
Subject: [PATCH 1/9] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_v

checkasm bench:
put_hevc_epel_uni_hv64_8_i8mm: 6568.7
put_hevc_epel_uni_v4_8_c: 88.7
put_hevc_epel_uni_v4_8_neon: 32.7
put_hevc_epel_uni_v6_8_c: 185.4
put_hevc_epel_uni_v6_8_neon: 44.9
put_hevc_epel_uni_v8_8_c: 333.9
put_hevc_epel_uni_v8_8_neon: 44.4
put_hevc_epel_uni_v12_8_c: 728.7
put_hevc_epel_uni_v12_8_neon: 119.7
put_hevc_epel_uni_v16_8_c: 1224.2
put_hevc_epel_uni_v16_8_neon: 139.7
put_hevc_epel_uni_v24_8_c: 2531.2
put_hevc_epel_uni_v24_8_neon: 329.9
put_hevc_epel_uni_v32_8_c: 4739.9
put_hevc_epel_uni_v32_8_neon: 562.7
put_hevc_epel_uni_v48_8_c: 10618.7
put_hevc_epel_uni_v48_8_neon: 1256.2
put_hevc_epel_uni_v64_8_c: 19169.9
put_hevc_epel_uni_v64_8_neon: 2179.2

Co-Authored-By: J. Dekker 
---
 libavcodec/aarch64/hevcdsp_epel_neon.S| 320 ++
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 2 files changed, 325 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index a8d694639b..7ce7eec829 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -32,6 +32,326 @@ const epel_filters, align=4
 .byte -2, 10, 58, -2
 endconst
 
+.macro load_epel_filterb freg, xreg
+movrel  \xreg, epel_filters
+add \xreg, \xreg, \freg, lsl #2
+ld4r{v0.16b, v1.16b, v2.16b, v3.16b}, [\xreg] // filter
+neg v0.16b, v0.16b
+neg v3.16b, v3.16b
+.endm
+
+.macro calc_epelb dst, src0, src1, src2, src3
+umlsl   \dst\().8h, \src0\().8b, v0.8b
+umlal   \dst\().8h, \src1\().8b, v1.8b
+umlal   \dst\().8h, \src2\().8b, v2.8b
+umlsl   \dst\().8h, \src3\().8b, v3.8b
+.endm
+
+.macro calc_epelb2 dst, src0, src1, src2, src3
+umlsl2  \dst\().8h, \src0\().16b, v0.16b
+umlal2  \dst\().8h, \src1\().16b, v1.16b
+umlal2  \dst\().8h, \src2\().16b, v2.16b
+umlsl2  \dst\().8h, \src3\().16b, v3.16b
+.endm
+
+.macro calc_all4
+calcv16, v17, v18, v19
+b.eq2f
+calcv17, v18, v19, v16
+b.eq2f
+calcv18, v19, v16, v17
+b.eq2f
+calcv19, v16, v17, v18
+b.ne1b
+.endm
+
+.macro calc_all8
+calcv16, v17, v18, v19, v20, v21, v22, v23
+b.eq2f
+calcv18, v19, v20, v21, v22, v23, v16, v17
+b.eq2f
+calcv20, v21, v22, v23, v16, v17, v18, v19
+b.eq2f
+calcv22, v23, v16, v17, v18, v19, v20, v21
+b.ne1b
+.endm
+
+.macro calc_all12
+calcv16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, 
v27
+b.eq2f
+calcv19, v20, v21, v22, v23, v24, v25, v26, v27, v16, v17, 
v18
+b.eq2f
+calcv22, v23, v24, v25, v26, v27, v16, v17, v18, v19, v20, 
v21
+b.eq2f
+calcv25, v26, v27, v16, v17, v18, v19, v20, v21, v22, v23, 
v24
+b.ne1b
+.endm
+
+.macro calc_all16
+calcv16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, 
v27, v28, v29, v30, v31
+b.eq2f
+calcv20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30, 
v31, v16, v17, v18, v19
+b.eq2f
+calcv24, v25, v26, v27, v28, v29, v30, v31, v16, v17, v18, 
v19, v20, v21, v22, v23
+b.eq2f
+calcv28, v29, v30, v31, v16, v17, v18, v19, v20, v21, v22, 
v23, v24, v25, v26, v27
+b.ne1b
+.endm
+
+function ff_hevc_put_hevc_epel_uni_v4_8_neon, export=1
+load_epel_filterb x6, x5
+sxtwx3, w3
+sxtw

Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_v

2023-09-12 Thread Martin Storsjö

Hi,

Sorry for not tending to your patches sooner.

Unfortunately, this patchset is impossible to apply - there seems to be 
garbled whitespace in the patch which would require me to manually apply 
all the changes.


Can you try sending the patches again in a way that doesn't corrupt 
whitespace? If not, can you push the branch somewhere where I can fetch 
it?


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".