I recall seeing some C/C++/D code that optimizes the comment- and whitespace-skipping parts (tokens) of lexers by operating on 2, 4 or 8-byte chunks instead of single-byte chunks. This in the case when token-terminators are expressed as sets of (alternative) ASCII-characters.

For instance, when searching for the end of a line comment, I would like to speed up the while-loop in

    size_t offset;
    string input = "// \n"; // a line-comment string
    import std.algorithm : among;
    // until end-of-line or file terminator
    while (!input[offset].among!('\0', '\n', '\r')
    {
        ++offset;
    }

by taking `offset`-steps larger than one.

Note that my file reading function that creates the real `input`, appends a '\0' at the end to enable sentinel-based search as shown in the call to `among` above.

I further recall that there are x86_64 intrinsics that can be used here for further speedups.

Refs, anyone?

Reply via email to