I originally did some altivec assembly, but it seems C altivec can be
nearly optimal using carefully constructed loops, and the occasional gcc
extension (labels as values). Considering the various ABI issues, VRsave,
and gratuitous gnu/apple differences, I have since re-implemented
everything in C.


For comparison, I'm appending a 16 bit C restore function; though the
setup and unaligned logic is typically not nice, the core algorithms are
(I hope) somewhat clear and reasonably small.  I have not done much
comparison between the C and asm functions, but I believe that the C
function is very nearly optimal in most cases.

Chris


On 2005/01/29, at 8:40, Brady Patterson wrote:


On Thu, 27 Jan 2005, John Steele Scott wrote:
That looks fine to me as well. However, the best solution is something which
Luca suggested a few months ago, which is to use the functions defined in
altivec.h. These are C functions which map directly to Altivec machine
instructions. I am willing to help out, but I don't find the current lpc_asm.s
very easy to follow, and my time is quite limited (my last patch to a free
software project took almost three months to get into decent shape!).

Is this still my code? IIRC I commented it extensively, but the structure is
certainly non-intuitive.


I'll take a look at it. At the time, I thought I wanted control logic that was
impossible in C, but that may not be the case. It didn't occur to me that Linux
and Apple would use different assemblers; elsewhere Apple uses the GNU tools.
I'm also a bit surprised that people are using flac on an Altivecful Linux/PPC
system (but I did attempt for such a system to fall back to the non-altivec C
code). End digression.


Can you point me to a good reference on altivec.h?

--
Brady Patterson ([EMAIL PROTECTED])
RLRR LRLL RLLR LRRL RRLR LLRL

_______________________________________________
Flac-dev mailing list
Flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev

void FLAC__lpc_restore_signal_16bit_altivec(const FLAC__int32 residual[], unsigned data_len, const FLAC__int32 qlp_coeff[], unsigned order, int lp_quantization, FLAC__int32 data[])
{
int i, j, *r, *end = (int *)residual + data_len, FLAC__align16 qc[16];
intptr_t do0;
vu8 p;
vs16 qF8, q70, hF8, h70, t;
vs32 r03, s, zero = vec_splat_s32(0);
vu32 lpq;


    FLAC__ASSERT(order > 0);
    FLAC__ASSERT(VecRelAligned(data, residual));

    if (order < 2 || order > 16) {
        FLAC__lpc_restore_signal(residual, data_len, qlp_coeff, order,
                lp_quantization, data);
        return;
    }

    /* Load lp_quantization into all elements of lpq
     */
    VecLoad4(lpq, (unsigned int *)&lp_quantization);

    /* qc[] = qlp_coeff[] reversed, aligned, and padded with enough
     * zeros to complete the vector.
     */
    j = order; i = 16; r = (int *)qlp_coeff;
    do {
        qc[--i] = *(r++);
    } while (--j);
    while (i & 3)
        qc[--i] = 0;

/* This switch loads the necessary qlp coefficients and data history
* into the q* and h* vectors. They are arranged like so:
* qF8 = qlp[15] - qlp[8], q70 = qlp[7] - qlp[0]
* hF8 = data[-16] - data[-9], h70 = data[-8] - data[-1]
* Loading the data is complicated by the fact that it may not be vector
* aligned. First, the loads are imlicitly rounded down one vector. Then,
* the packed vectors need to be shifted so that the actual data is
* aligned at the right. That is the purpose of p here.
*/
p = vec_lvsr(0, (short *)((-(intptr_t)data & 15) >> 1));
r03 = s = zero;
switch (order + 3 & ~3) {
case 16:
r03 = vec_ld(0, qc);
s = vec_ld(-49, data);
case 12:
qF8 = vec_pack(r03, vec_ld(16, qc));
t = vec_pack( s, vec_ld(-33, data));
hF8 = vec_perm( t, t, p);
case 8:
r03 = vec_ld(32, qc);
s = vec_ld(-17, data);
case 4:
q70 = vec_pack(r03, vec_ld(48, qc));
h70 = vec_pack( s, vec_ld(-1, data));
h70 = vec_perm( t, h70, p);
}


/* p is used to shift the history vector to the left one element, and
* to insert the recently calculated data element s. Keep in mind,
* restore*() only computes one data element at a time: the vec_sums()
* leaves the sum in the high word, and the remaining calculation of s
* is entirely serial.
*/
p = (vu8)AVV( 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,30,31);


do0 = (intptr_t)data - (intptr_t)residual - 16; /* -16 for preincrement */
r = (int *)residual;
r03 = vec_ld(0, residual);


if (order > 8) {
#define restore16(r) \
s = vec_sums(vec_msum(q70, h70, vec_msum(qF8, hF8, zero)), zero); \
s = vec_add(r, vec_sra(s, lpq)); \
hF8 = vec_sld(hF8, h70, 2); h70 = vec_perm(h70, (vs16)s, p);
do {
restore16(vec_perm(r03, r03, vec_lvsl(0, ++r)));
} while (!VecAligned(r));
vec_st(vec_unpackl(h70), 0, data);
while (r < end) {
r03 = vec_ld(0, r);
r += 4;
restore16(vec_splat(r03, 0));
restore16(vec_splat(r03, 1));
restore16(vec_splat(r03, 2));
restore16(vec_splat(r03, 3));
vec_st(vec_unpackl(h70), do0, r);
}
#undef restore16
} else {
#define restore8(r) \
s = vec_sums(vec_msum(q70, h70, zero), zero); \
s = vec_add(r, vec_sra(s, lpq)); \
h70 = vec_perm(h70, (vs16)s, p);
do {
restore8(vec_perm(r03, r03, vec_lvsl(0, ++r)));
} while (!VecAligned(r));
vec_st(vec_unpackl(h70), 0, data);
while (r < end) {
r03 = vec_ld(0, r);
r += 4;
restore8(vec_splat(r03, 0));
restore8(vec_splat(r03, 1));
restore8(vec_splat(r03, 2));
restore8(vec_splat(r03, 3));
vec_st(vec_unpackl(h70), do0, r);
}
#undef restore8
}
}


_______________________________________________
Flac-dev mailing list
Flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev

Reply via email to