I originally did some altivec assembly, but it seems C altivec can be
nearly optimal using carefully constructed loops, and the occasional gcc
extension (labels as values). Considering the various ABI issues, VRsave,
and gratuitous gnu/apple differences, I have since re-implemented
everything in C.
For comparison, I'm appending a 16 bit C restore function; though the setup and unaligned logic is typically not nice, the core algorithms are (I hope) somewhat clear and reasonably small. I have not done much comparison between the C and asm functions, but I believe that the C function is very nearly optimal in most cases.
Chris
On 2005/01/29, at 8:40, Brady Patterson wrote:
On Thu, 27 Jan 2005, John Steele Scott wrote:That looks fine to me as well. However, the best solution is something which
Luca suggested a few months ago, which is to use the functions defined in
altivec.h. These are C functions which map directly to Altivec machine
instructions. I am willing to help out, but I don't find the current lpc_asm.s
very easy to follow, and my time is quite limited (my last patch to a free
software project took almost three months to get into decent shape!).
Is this still my code? IIRC I commented it extensively, but the structure is
certainly non-intuitive.
I'll take a look at it. At the time, I thought I wanted control logic that was
impossible in C, but that may not be the case. It didn't occur to me that Linux
and Apple would use different assemblers; elsewhere Apple uses the GNU tools.
I'm also a bit surprised that people are using flac on an Altivecful Linux/PPC
system (but I did attempt for such a system to fall back to the non-altivec C
code). End digression.
Can you point me to a good reference on altivec.h?
-- Brady Patterson ([EMAIL PROTECTED]) RLRR LRLL RLLR LRRL RRLR LLRL
_______________________________________________ Flac-dev mailing list Flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
void FLAC__lpc_restore_signal_16bit_altivec(const FLAC__int32 residual[], unsigned data_len, const FLAC__int32 qlp_coeff[], unsigned order, int lp_quantization, FLAC__int32 data[])
{
int i, j, *r, *end = (int *)residual + data_len, FLAC__align16 qc[16];
intptr_t do0;
vu8 p;
vs16 qF8, q70, hF8, h70, t;
vs32 r03, s, zero = vec_splat_s32(0);
vu32 lpq;
FLAC__ASSERT(order > 0); FLAC__ASSERT(VecRelAligned(data, residual));
if (order < 2 || order > 16) { FLAC__lpc_restore_signal(residual, data_len, qlp_coeff, order, lp_quantization, data); return; }
/* Load lp_quantization into all elements of lpq */ VecLoad4(lpq, (unsigned int *)&lp_quantization);
/* qc[] = qlp_coeff[] reversed, aligned, and padded with enough * zeros to complete the vector. */ j = order; i = 16; r = (int *)qlp_coeff; do { qc[--i] = *(r++); } while (--j); while (i & 3) qc[--i] = 0;
/* This switch loads the necessary qlp coefficients and data history
* into the q* and h* vectors. They are arranged like so:
* qF8 = qlp[15] - qlp[8], q70 = qlp[7] - qlp[0]
* hF8 = data[-16] - data[-9], h70 = data[-8] - data[-1]
* Loading the data is complicated by the fact that it may not be vector
* aligned. First, the loads are imlicitly rounded down one vector. Then,
* the packed vectors need to be shifted so that the actual data is
* aligned at the right. That is the purpose of p here.
*/
p = vec_lvsr(0, (short *)((-(intptr_t)data & 15) >> 1));
r03 = s = zero;
switch (order + 3 & ~3) {
case 16:
r03 = vec_ld(0, qc);
s = vec_ld(-49, data);
case 12:
qF8 = vec_pack(r03, vec_ld(16, qc));
t = vec_pack( s, vec_ld(-33, data));
hF8 = vec_perm( t, t, p);
case 8:
r03 = vec_ld(32, qc);
s = vec_ld(-17, data);
case 4:
q70 = vec_pack(r03, vec_ld(48, qc));
h70 = vec_pack( s, vec_ld(-1, data));
h70 = vec_perm( t, h70, p);
}
/* p is used to shift the history vector to the left one element, and
* to insert the recently calculated data element s. Keep in mind,
* restore*() only computes one data element at a time: the vec_sums()
* leaves the sum in the high word, and the remaining calculation of s
* is entirely serial.
*/
p = (vu8)AVV( 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,30,31);
do0 = (intptr_t)data - (intptr_t)residual - 16; /* -16 for preincrement */
r = (int *)residual;
r03 = vec_ld(0, residual);
if (order > 8) {
#define restore16(r) \
s = vec_sums(vec_msum(q70, h70, vec_msum(qF8, hF8, zero)), zero); \
s = vec_add(r, vec_sra(s, lpq)); \
hF8 = vec_sld(hF8, h70, 2); h70 = vec_perm(h70, (vs16)s, p);
do {
restore16(vec_perm(r03, r03, vec_lvsl(0, ++r)));
} while (!VecAligned(r));
vec_st(vec_unpackl(h70), 0, data);
while (r < end) {
r03 = vec_ld(0, r);
r += 4;
restore16(vec_splat(r03, 0));
restore16(vec_splat(r03, 1));
restore16(vec_splat(r03, 2));
restore16(vec_splat(r03, 3));
vec_st(vec_unpackl(h70), do0, r);
}
#undef restore16
} else {
#define restore8(r) \
s = vec_sums(vec_msum(q70, h70, zero), zero); \
s = vec_add(r, vec_sra(s, lpq)); \
h70 = vec_perm(h70, (vs16)s, p);
do {
restore8(vec_perm(r03, r03, vec_lvsl(0, ++r)));
} while (!VecAligned(r));
vec_st(vec_unpackl(h70), 0, data);
while (r < end) {
r03 = vec_ld(0, r);
r += 4;
restore8(vec_splat(r03, 0));
restore8(vec_splat(r03, 1));
restore8(vec_splat(r03, 2));
restore8(vec_splat(r03, 3));
vec_st(vec_unpackl(h70), do0, r);
}
#undef restore8
}
}
_______________________________________________ Flac-dev mailing list Flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev