On Sat, 02 Apr 2016 13:30:58 +0100, Mizuki Asakura <[email protected]> wrote:
This patch only contains STD_FAST_PATH codes, not scaling (nearest, bilinear) codes.
Hi Mizuki, It looks like you have used an automated process to convert the AArch32 NEON code to AArch64. Will you be able to repeat that process for other code, or at least assist others to repeat your steps? The reason I ask is that I have a large number of outstanding patches to the ARM NEON support. The process of getting them merged into the FreeDesktop git repository has been very slow because there aren't many people on this list with the time and ability to review them, however my versions are in many cases up to twice the speed of the FreeDesktop versions, and it would be a shame if AArch64 couldn't benefit from them. If your AArch64 conversion is a one-time thing, it will make make it extremely difficult to merge my changes in.
After completing optimization this patch, scaling related codes should be done.
One of my aims was to implement missing "iter" routines so as to accelerate scaled plots for a much wider combination of pixels formats and Porter-Duff combiner rules than the existing limited selection of fast paths could cover. If you look towards the end of my patch series here: https://github.com/bavison/pixman/commits/arm-neon-release1 you'll see that I discovered that I was actually outperforming Pixman's existing bilinear plotters so consistently that I'm advocating removing them entirely, with the additional advantage that it simplifies the code base a lot. So you might want to consider whether it's worth bothering converting those to AArch64 in the first place. I would maybe go so far as to suggest that you try converting all the iters first and only add fast paths if you find they do better than the iters. One of the drawbacks of using iters is that the prefetch code can't be as sophisticated - it can't easily be prefetching the start of the next row while it is still working on the end of the current one. But since hardware prefetchers are better now and conditional execution is hard in AArch64, this will be less of a drawback with AArch64 CPUs. I'll also repeat what has been said, that it's very neat the way the existing prefetch code sneaks calculations into pipeline stalls, but it was only ever really ideal for Cortex-A8. With Cortex-A7 (despite the number, actually a much more recent 32-bit core) I noted that it was impossible to schedule such complex prefetch code without adding to the cycle count, at least when the images were already in the cache. Ben _______________________________________________ Pixman mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/pixman
