altivec

Chris Csanady Sat, 29 Jan 2005 19:17:18 -0800

I originally did some altivec assembly, but it seems C altivec can be nearly optimal using carefully constructed loops, and the occasional gcc extension (labels as values). Considering the various ABI issues, VRsave, and gratuitous gnu/apple differences, I have since re-implemented everything in C.

For comparison, I'm appending a 16 bit C restore function; though the
setup and unaligned logic is typically not nice, the core algorithms are
(I hope) somewhat clear and reasonably small.  I have not done much
comparison between the C and asm functions, but I believe that the C
function is very nearly optimal in most cases.

Chris


On 2005/01/29, at 8:40, Brady Patterson wrote:

On Thu, 27 Jan 2005, John Steele Scott wrote:
That looks fine to me as well. However, the best solution is something which Luca suggested a few months ago, which is to use the functions defined in altivec.h. These are C functions which map directly to Altivec machine instructions. I am willing to help out, but I don't find the current lpc_asm.s very easy to follow, and my time is quite limited (my last patch to a free software project took almost three months to get into decent shape!).
Is this still my code? IIRC I commented it extensively, but the structure is certainly non-intuitive.

I'll take a look at it. At the time, I thought I wanted control logic that was impossible in C, but that may not be the case. It didn't occur to me that Linux and Apple would use different assemblers; elsewhere Apple uses the GNU tools. I'm also a bit surprised that people are using flac on an Altivecful Linux/PPC system (but I did attempt for such a system to fall back to the non-altivec C code). End digression.
Can you point me to a good reference on altivec.h?
--
Brady Patterson ([EMAIL PROTECTED])
RLRR LRLL RLLR LRRL RRLR LLRL
_______________________________________________
Flac-dev mailing list
Flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev

void FLAC__lpc_restore_signal_16bit_altivec(const FLAC__int32 residual[], unsigned data_len, const FLAC__int32 qlp_coeff[], unsigned order, int lp_quantization, FLAC__int32 data[]) { int i, j, *r, *end = (int *)residual + data_len, FLAC__align16 qc[16]; intptr_t do0; vu8 p; vs16 qF8, q70, hF8, h70, t; vs32 r03, s, zero = vec_splat_s32(0); vu32 lpq;

    FLAC__ASSERT(order > 0);
    FLAC__ASSERT(VecRelAligned(data, residual));

    if (order < 2 || order > 16) {
        FLAC__lpc_restore_signal(residual, data_len, qlp_coeff, order,
                lp_quantization, data);
        return;
    }

    /* Load lp_quantization into all elements of lpq
     */
    VecLoad4(lpq, (unsigned int *)&lp_quantization);

    /* qc[] = qlp_coeff[] reversed, aligned, and padded with enough
     * zeros to complete the vector.
     */
    j = order; i = 16; r = (int *)qlp_coeff;
    do {
        qc[--i] = *(r++);
    } while (--j);
    while (i & 3)
        qc[--i] = 0;

/* This switch loads the necessary qlp coefficients and data history * into the q* and h* vectors. They are arranged like so: * qF8 = qlp[15] - qlp[8], q70 = qlp[7] - qlp[0] * hF8 = data[-16] - data[-9], h70 = data[-8] - data[-1] * Loading the data is complicated by the fact that it may not be vector * aligned. First, the loads are imlicitly rounded down one vector. Then, * the packed vectors need to be shifted so that the actual data is * aligned at the right. That is the purpose of p here. */ p = vec_lvsr(0, (short *)((-(intptr_t)data & 15) >> 1)); r03 = s = zero; switch (order + 3 & ~3) { case 16: r03 = vec_ld(0, qc); s = vec_ld(-49, data); case 12: qF8 = vec_pack(r03, vec_ld(16, qc)); t = vec_pack( s, vec_ld(-33, data)); hF8 = vec_perm( t, t, p); case 8: r03 = vec_ld(32, qc); s = vec_ld(-17, data); case 4: q70 = vec_pack(r03, vec_ld(48, qc)); h70 = vec_pack( s, vec_ld(-1, data)); h70 = vec_perm( t, h70, p); }

/* p is used to shift the history vector to the left one element, and * to insert the recently calculated data element s. Keep in mind, * restore*() only computes one data element at a time: the vec_sums() * leaves the sum in the high word, and the remaining calculation of s * is entirely serial. */ p = (vu8)AVV( 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,30,31);

do0 = (intptr_t)data - (intptr_t)residual - 16; /* -16 for preincrement */ r = (int *)residual; r03 = vec_ld(0, residual);

if (order > 8) { #define restore16(r) \ s = vec_sums(vec_msum(q70, h70, vec_msum(qF8, hF8, zero)), zero); \ s = vec_add(r, vec_sra(s, lpq)); \ hF8 = vec_sld(hF8, h70, 2); h70 = vec_perm(h70, (vs16)s, p); do { restore16(vec_perm(r03, r03, vec_lvsl(0, ++r))); } while (!VecAligned(r)); vec_st(vec_unpackl(h70), 0, data); while (r < end) { r03 = vec_ld(0, r); r += 4; restore16(vec_splat(r03, 0)); restore16(vec_splat(r03, 1)); restore16(vec_splat(r03, 2)); restore16(vec_splat(r03, 3)); vec_st(vec_unpackl(h70), do0, r); } #undef restore16 } else { #define restore8(r) \ s = vec_sums(vec_msum(q70, h70, zero), zero); \ s = vec_add(r, vec_sra(s, lpq)); \ h70 = vec_perm(h70, (vs16)s, p); do { restore8(vec_perm(r03, r03, vec_lvsl(0, ++r))); } while (!VecAligned(r)); vec_st(vec_unpackl(h70), 0, data); while (r < end) { r03 = vec_ld(0, r); r += 4; restore8(vec_splat(r03, 0)); restore8(vec_splat(r03, 1)); restore8(vec_splat(r03, 2)); restore8(vec_splat(r03, 3)); vec_st(vec_unpackl(h70), do0, r); } #undef restore8 } }

_______________________________________________
Flac-dev mailing list
Flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev

Re: [Flac-dev] A couple of points about flac 1.1.1 on ppc/linux/altivec

Reply via email to