Gitweb links:

...log 
http://git.netsurf-browser.org/libnsgif.git/shortlog/c8c36737b57bd5ed7a1872e74efc3226e6031aaf
...commit 
http://git.netsurf-browser.org/libnsgif.git/commit/c8c36737b57bd5ed7a1872e74efc3226e6031aaf
...tree 
http://git.netsurf-browser.org/libnsgif.git/tree/c8c36737b57bd5ed7a1872e74efc3226e6031aaf

The branch, tlsa/rewrite has been updated
       via  c8c36737b57bd5ed7a1872e74efc3226e6031aaf (commit)
      from  1b0c8759f5d96e96ded4cb50bb2772103780e2eb (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commitdiff 
http://git.netsurf-browser.org/libnsgif.git/commit/?id=c8c36737b57bd5ed7a1872e74efc3226e6031aaf
commit c8c36737b57bd5ed7a1872e74efc3226e6031aaf
Author: Michael Drake <t...@netsurf-browser.org>
Commit: Michael Drake <t...@netsurf-browser.org>

    GIF: Optimise deinterlacing of frame rows.
    
    The original interlacing handling looked like this:
    
    ```c
    static inline uint32_t f1(int height, int y)
    {
            if ((y << 3) < height) {
                    return (y << 3);         /* Pass 1 */
            }
            y -= ((height + 7) >> 3);
            if ((y << 3) < (height - 4)) {
                    return (y << 3) + 4;     /* Pass 2 */
            }
            y -= ((height + 3) >> 3);
            if ((y << 2) < (height - 2)) {
                    return (y << 2) + 2;     /* Pass 3 */
            }
            y -= ((height + 1) >> 2);
            return (y << 1) + 1;             /* Pass 4 */
    }
    ```
    
    Annotated with comments to show the passes.  Thought it should be possible
    to do the job without all the calculations and subtractions from `y`, and
    I also didn't like that it was not super obvious how it aligned with the
    spec or how it dealt with pass 4 last, when it is the most common case,
    happening for 50% of image rows.
    
    See the following excerpt from from the GIF89a Specification.
    
    > The rows of an Interlaced images are arranged in the following order:
    >
    >      Group 1 : Every 8th. row, starting with row 0.              (Pass 1)
    >      Group 2 : Every 8th. row, starting with row 4.              (Pass 2)
    >      Group 3 : Every 4th. row, starting with row 2.              (Pass 3)
    >      Group 4 : Every 2nd. row, starting with row 1.              (Pass 4)
    >
    > The Following example illustrates how the rows of an interlaced image are
    > ordered.
    >
    >      Row Number                                        Interlace Pass
    >
    > 0    -----------------------------------------       1
    > 1    -----------------------------------------                         4
    > 2    -----------------------------------------                   3
    > 3    -----------------------------------------                         4
    > 4    -----------------------------------------             2
    > 5    -----------------------------------------                         4
    > 6    -----------------------------------------                   3
    > 7    -----------------------------------------                         4
    > 8    -----------------------------------------       1
    > 9    -----------------------------------------                         4
    > 10   -----------------------------------------                   3
    > 11   -----------------------------------------                         4
    > 12   -----------------------------------------             2
    > 13   -----------------------------------------                         4
    > 14   -----------------------------------------                   3
    > 15   -----------------------------------------                         4
    > 16   -----------------------------------------       1
    > 17   -----------------------------------------                         4
    > 18   -----------------------------------------                   3
    > 19   -----------------------------------------                         4
    
    -- https://www.w3.org/Graphics/GIF/spec-gif89a.txt
    
    To test performance I ran it for all possible GIF heights:
    
    ```c
            for (unsigned height = 1; height < UINT16_MAX; height++) {
                    for (uint32_t y = 0; y < height; y++) {
                            use_val(f1(height, y));
                    }
            }
    ```
    
    For the original implementation (`f1`), I got the following performance:
    
    | Compiler | Time   |
    | -------- | ------ |
    | Clang    | 3.830s |
    | GCC      | 4.737s |
    
    All tests used GCC 11.2 and Clang 13.0, with `-O2` optimisation.
    
    Next I rewrote it to more closely resemble my reading of the spec.
    The compiler can convert the multiplies to shifts and I found this
    more readable than `f1`, and more closely matching the spec.
    
    The `height / N * N` bits all use powers of two for N so they boil
    down the a simple bitwise AND with an appropriate mask to round down
    to the nearest multiple of N.
    
    ```c
    static inline uint32_t f2(uint32_t height, uint32_t y)
    {
            height--;
            if (y * 8 <= height) {
                    return y * 8;                       /* Pass 1 */
            }
            if (y * 4 <= height) {
                    return y * 8 - height / 8 * 8 - 4;  /* Pass 2 */
            }
            if (y * 2 <= height) {
                    return y * 4 - height / 4 * 4 - 2;  /* Pass 3 */
            }
            return y * 2 - height / 2 * 2 - 1;          /* Pass 4 */
    }
    ```
    
    For `f2`, I got the following performance:
    
    | Compiler | Time   |
    | -------- | ------ |
    | Clang    | 2.758s |
    | GCC      | 3.909s |
    
    So it's faster than `f1`, and Clang is still doing better than GCC.
    
    After that I reversed the order, so that pass 4 would be handled first.
    
    ```c
    static inline uint32_t f3(uint32_t height, uint32_t y)
    {
            height--;
            if (y * 2 > height) {
                    return y * 2 - height / 2 * 2 - 1;
            }
            if (y * 4 > height) {
                    return y * 4 - height / 4 * 4 - 2;
            }
            if (y * 8 > height) {
                    return y * 8 - height / 8 * 8 - 4;
            }
            return y * 8;
    }
    ```
    
    For `f3`, I got the following performance:
    
    | Compiler | Time   |
    | -------- | ------ |
    | Clang    | 3.329s |
    | GCC      | 3.630s |
    
    Interestingly this was consistently faster than `f2` with GCC, but
    slower than `f2` when built with Clang. (It's still faster than the
    original `f1` with both compilers.)
    
    I thought it might be because the branches in `f2` are more
    predictable than in `f3`.  For example the first branch in
    `f2` is only true one eighth of the time, while all the branches
    in `f3` are 50:50.
    
    However, running under `perf` showed this was not the case.
    Actually the faster `f2` for Clang had more branch misses than
    the Clang `f3`.
    
    | Compiler | Implementation | Branches       | Branch misses |
    | -------- | -------------- | -------------- | ------------- |
    | Clang    | `f2`           | 13,690,378,148 |       656,517 |
    | Clang    | `f3`           | 10,738,688,711 |       275,147 |
    | GCC      | `f2`           | 10,738,659,995 |       336,992 |
    | GCC      | `f3`           | 10,201,842,339 |     1,314,445 |
    
    After this I tried changing the implementation completely.  All of
    the previous implementations were based on the idea of mapping from
    an uninterlaced row number to an interlaced one.
    
    ```c
    static inline bool f4(uint32_t height, uint32_t *y, uint8_t *pass)
    {
            *y += *pass == 0 ? 8 : 16 >> *pass;
    
            if (*y < height) return true;
    
            switch (*pass) {
            case 0: *pass += 1; *y = 4; if (*y < height) return true;
            case 1: *pass += 1; *y = 2; if (*y < height) return true;
            case 2: *pass += 1; *y = 1; return true;
            }
    
            return false;
    }
    ```
    
    With `f4` we simply step to the next line in interlaced order.
    The `16 >> *pass` generates the correct step size for all but the
    first pass, so the first pass is special cased.
    
    The switch statement if for handling advancing from one pass to
    the next, so it is reached a maximum of four times per frame.
    The rest of the time, we return on the `if (*y < height) return true;`
    above.  The switch falls through to handle the case where the frame
    height is less than the start row for the given pass.
    
    For `f4`, I got the following performance:
    
    | Compiler | Time   |
    | -------- | ------ |
    | Clang    | 2.315s |
    | GCC      | 3.479s |
    
    Again, this was tested for all possible GIF heights, but the `for`
    loop for rows changes to a do-while loop:
    
    ```c
            for (unsigned height = 1; height < UINT16_MAX; height++) {
                    uint8_t pass = 0;
                    uint32_t y = 0;
                    do {
                            use_val(y);
                    } while (f4(height, &y, &pass));
            }
    ```
    
    This is the fastest yet for both compilers, with Clang still performing
    best.
    
    One thing I didn't like about `f4` is that it calculates the y step
    size every call, although the step only actually changes three times,
    when we change to the next pass.
    
    So for `f5` the switch was updated to set the step size.
    
    ```c
    static inline bool f5(uint32_t height, uint32_t *y, uint8_t *step)
    {
            *y += *step & 0xf;
    
            if (*y < height) return true;
    
            switch (*step) {
            case 24: *y = 4; *step = 8; if (*y < height) return true;
            case  8: *y = 2; *step = 4; if (*y < height) return true;
            case  4: *y = 1; *step = 2; return true;
            }
    
            return false;
    }
    ```
    
    To avoid the caller having to maintain extra state, the pass number
    is no longer used, and we use the step size to keep track of which
    pass we are on.  However, the step size for the first two passes is
    the same, so that led to the strange use of 24 for the initial step
    size, and the masking of step to ensure only the bottom four bits are
    used.
    
    For `f5`, I got the following performance:
    
    | Compiler | Time   |
    | -------- | ------ |
    | Clang    | 2.364s |
    | GCC      | 2.891s |
    
    This is the fastest yet for both compilers, with Clang still winning
    easily.
    
    Note that when the functions are not allowed to inline, `f2` and `f3`
    are the fastest.

diff --git a/src/libnsgif.c b/src/libnsgif.c
index 7d48b7e..6046769 100644
--- a/src/libnsgif.c
+++ b/src/libnsgif.c
@@ -247,21 +247,56 @@ static gif_result gif__recover_frame(
        return GIF_OK;
 }
 
-static uint32_t gif_interlaced_line(int height, int y)
+/**
+ * Get the next line for GIF decode.
+ *
+ * Note that the step size must be initialised to 24 at the start of the frame
+ * (when y == 0).  This is because of the first two passes of the frame have
+ * the same step size of 8, and the step size is used to determine the current
+ * pass.
+ *
+ * \param[in]     height     Frame height in pixels.
+ * \param[in,out] y          Current row, starting from 0, updated on exit.
+ * \param[in,out] step       Current step starting with 24, updated on exit.
+ * \return true if there is a row to process, false at the end of the frame.
+ */
+static inline bool gif__deinterlace(uint32_t height, uint32_t *y, uint8_t 
*step)
 {
-       if ((y << 3) < height) {
-               return (y << 3);
+       *y += *step & 0xf;
+
+       if (*y < height) return true;
+
+       switch (*step) {
+       case 24: *y = 4; *step = 8; if (*y < height) return true;
+                /* Fall through. */
+       case  8: *y = 2; *step = 4; if (*y < height) return true;
+                /* Fall through. */
+       case  4: *y = 1; *step = 2; if (*y < height) return true;
+                /* Fall through. */
+       default:
+               break;
        }
-       y -= ((height + 7) >> 3);
-       if ((y << 3) < (height - 4)) {
-               return (y << 3) + 4;
-       }
-       y -= ((height + 3) >> 3);
-       if ((y << 2) < (height - 2)) {
-               return (y << 2) + 2;
+
+       return false;
+}
+
+/**
+ * Get the next line for GIF decode.
+ *
+ * \param[in]     interlace  Non-zero if the frame is not interlaced.
+ * \param[in]     height     Frame height in pixels.
+ * \param[in,out] y          Current row, starting from 0, updated on exit.
+ * \param[in,out] step       Current step starting with 24, updated on exit.
+ * \return true if there is a row to process, false at the end of the frame.
+ */
+static inline bool gif__next_row(uint32_t interlace,
+               uint32_t height, uint32_t *y, uint8_t *step)
+{
+       if (!interlace) {
+               return (++*y != height);
+       } else {
+               return gif__deinterlace(height, y, step);
        }
-       y -= ((height + 1) >> 2);
-       return (y << 1) + 1;
 }
 
 static gif_result gif__decode_complex(
@@ -276,9 +311,15 @@ static gif_result gif__decode_complex(
                uint32_t *restrict frame_data,
                uint32_t *restrict colour_table)
 {
-       uint32_t available = 0;
-       gif_result ret = GIF_OK;
        lzw_result res;
+       gif_result ret = GIF_OK;
+       uint32_t available = 0;
+       uint8_t step = 24;
+       uint32_t y = 0;
+
+       if (height == 0) {
+               return GIF_OK;
+       }
 
        /* Initialise the LZW decoding */
        res = lzw_decode_init(gif->lzw_ctx, data[0],
@@ -288,17 +329,12 @@ static gif_result gif__decode_complex(
                return gif_error_from_lzw(res);
        }
 
-       for (uint32_t y = 0; y < height; y++) {
+       do {
                uint32_t x;
-               uint32_t decode_y;
                uint32_t *frame_scanline;
 
-               if (interlace) {
-                       decode_y = gif_interlaced_line(height, y) + offset_y;
-               } else {
-                       decode_y = y + offset_y;
-               }
-               frame_scanline = frame_data + offset_x + (decode_y * 
gif->width);
+               frame_scanline = frame_data + offset_x +
+                               (y + offset_y) * gif->width;
 
                x = width;
                while (x > 0) {
@@ -338,7 +374,8 @@ static gif_result gif__decode_complex(
                                }
                        }
                }
-       }
+       } while (gif__next_row(interlace, height, &y, &step));
+
        return ret;
 }
 


-----------------------------------------------------------------------

Summary of changes:
 src/libnsgif.c |   83 ++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 60 insertions(+), 23 deletions(-)

diff --git a/src/libnsgif.c b/src/libnsgif.c
index 7d48b7e..6046769 100644
--- a/src/libnsgif.c
+++ b/src/libnsgif.c
@@ -247,21 +247,56 @@ static gif_result gif__recover_frame(
        return GIF_OK;
 }
 
-static uint32_t gif_interlaced_line(int height, int y)
+/**
+ * Get the next line for GIF decode.
+ *
+ * Note that the step size must be initialised to 24 at the start of the frame
+ * (when y == 0).  This is because of the first two passes of the frame have
+ * the same step size of 8, and the step size is used to determine the current
+ * pass.
+ *
+ * \param[in]     height     Frame height in pixels.
+ * \param[in,out] y          Current row, starting from 0, updated on exit.
+ * \param[in,out] step       Current step starting with 24, updated on exit.
+ * \return true if there is a row to process, false at the end of the frame.
+ */
+static inline bool gif__deinterlace(uint32_t height, uint32_t *y, uint8_t 
*step)
 {
-       if ((y << 3) < height) {
-               return (y << 3);
+       *y += *step & 0xf;
+
+       if (*y < height) return true;
+
+       switch (*step) {
+       case 24: *y = 4; *step = 8; if (*y < height) return true;
+                /* Fall through. */
+       case  8: *y = 2; *step = 4; if (*y < height) return true;
+                /* Fall through. */
+       case  4: *y = 1; *step = 2; if (*y < height) return true;
+                /* Fall through. */
+       default:
+               break;
        }
-       y -= ((height + 7) >> 3);
-       if ((y << 3) < (height - 4)) {
-               return (y << 3) + 4;
-       }
-       y -= ((height + 3) >> 3);
-       if ((y << 2) < (height - 2)) {
-               return (y << 2) + 2;
+
+       return false;
+}
+
+/**
+ * Get the next line for GIF decode.
+ *
+ * \param[in]     interlace  Non-zero if the frame is not interlaced.
+ * \param[in]     height     Frame height in pixels.
+ * \param[in,out] y          Current row, starting from 0, updated on exit.
+ * \param[in,out] step       Current step starting with 24, updated on exit.
+ * \return true if there is a row to process, false at the end of the frame.
+ */
+static inline bool gif__next_row(uint32_t interlace,
+               uint32_t height, uint32_t *y, uint8_t *step)
+{
+       if (!interlace) {
+               return (++*y != height);
+       } else {
+               return gif__deinterlace(height, y, step);
        }
-       y -= ((height + 1) >> 2);
-       return (y << 1) + 1;
 }
 
 static gif_result gif__decode_complex(
@@ -276,9 +311,15 @@ static gif_result gif__decode_complex(
                uint32_t *restrict frame_data,
                uint32_t *restrict colour_table)
 {
-       uint32_t available = 0;
-       gif_result ret = GIF_OK;
        lzw_result res;
+       gif_result ret = GIF_OK;
+       uint32_t available = 0;
+       uint8_t step = 24;
+       uint32_t y = 0;
+
+       if (height == 0) {
+               return GIF_OK;
+       }
 
        /* Initialise the LZW decoding */
        res = lzw_decode_init(gif->lzw_ctx, data[0],
@@ -288,17 +329,12 @@ static gif_result gif__decode_complex(
                return gif_error_from_lzw(res);
        }
 
-       for (uint32_t y = 0; y < height; y++) {
+       do {
                uint32_t x;
-               uint32_t decode_y;
                uint32_t *frame_scanline;
 
-               if (interlace) {
-                       decode_y = gif_interlaced_line(height, y) + offset_y;
-               } else {
-                       decode_y = y + offset_y;
-               }
-               frame_scanline = frame_data + offset_x + (decode_y * 
gif->width);
+               frame_scanline = frame_data + offset_x +
+                               (y + offset_y) * gif->width;
 
                x = width;
                while (x > 0) {
@@ -338,7 +374,8 @@ static gif_result gif__decode_complex(
                                }
                        }
                }
-       }
+       } while (gif__next_row(interlace, height, &y, &step));
+
        return ret;
 }
 


-- 
NetSurf GIF Decoder
_______________________________________________
netsurf-commits mailing list -- netsurf-commits@netsurf-browser.org
To unsubscribe send an email to netsurf-commits-le...@netsurf-browser.org

Reply via email to