A crazy idea for accurately tracking source position

Alix Pexton via Digitalmars-d Wed, 16 Apr 2014 11:46:48 -0700

TL;DR

Here is some under documented, incomplete and untested code.
CAVEAT IMPLEMENTOR: some details have been omitted to keep things brief!


struct someRange
{
        ulong seq;
        bool fresh = true;
        long line;
        dchar front;
        // and lets just pretend that there is
        // somewhere for more characters to come from!

        void popFront()
        {
                // advance by whatever means to update front.
                if (front.isNewline)
                {
                        ++line;
                        fresh = true;
                        return;
                }
                if (fresh)
                {
                        if (front.isTab)
                        {
                                seq = 0xffff,ffff,ffff,fffeL;
                        }
                        else
                        {
                                seq = 0x1L;
                        }
                        fresh = false;
                }
                else
                {
                        seq <<= 1;
                        if (!front.isTab)
                        {
                                seq |= 0x1L;
                        }
                }
        }

        // and the rest...
}

A long time ago I wrote a very rudimentary XML lexer/parser in pascal.At the time I thought it was a good idea to point to the exact characterwhere a error was detected. Knowing that tabs could be involved and thatthey can have different widths I stored the line position as atabs/spaces tuple, because no one would ever put a space anywhere but atthe beginning of the line, right!

Jump forward a decade or so and I know better. I.e. just knowing thenumber of tabs and spaces isn't enough, because when tabs can beanywhere, sometimes the spaces are swallowed up. What is needed is astring of tabs and spaces that matches the sequence of tabs and non-tabsin the source. Such could be built while lexing for immediate use if aninvalid character is encountered and then thrown away at each newline,but it would not be practical to store that much information in everytoken. The sequence could be split between tokens from a single line,with each token having just the pattern since the last token, butreversing the reading of the tokens in order to reconstruct the sequenceor building it while parsing just in case it is needed are at bestimpractical.

What would help would be a way of fitting that sequence of tabs andspaces into a smaller format.


Here is the crazy part...

Using a ulong (or longer if possible) to store our tab sequence...

Upon starting to lex a fresh line we check the first character, if its atab we set all but the lsb to 1 . If that first char is anything otherthan a tab, we set all bits but the lsb to 0 .

On each subsequent character in the line we shift the sequence left 1bit and set the new lsb as 0 if its a tab and 1 if it is anything else.

If the line is longer than the number of bits in our ulong[er] we throwour toys out of the pram and go home in tears.

Any time a token is emitted the current (or most relevant) value of theulong can be stored in it.

To decode the sequence if it is needed, we check the msb, if it is 1then the first character is a tab and we shift the whole value until thefirst 0 reaches the msb (keeping track of how many shifts we do so asnot to reach apple headquarters) and then one more shift to account forthe first character. If the msb is 0 then the first character is a spaceand we shift left until one past the first 1 . For each remaining bit weadd a tab when the msb is 0 and a space when it is 1 .

Thus we have reconstructed a string that when displayed above or belowthe line that generated it, will end at the correct character,regardless of the number of tabs or spaces used to represent them. Hurrah!

For any type of source where lexed lines are regularly contain morecharacters than there are bits in our longest integer, this techniquewill fail. However, I reason that in most cases the lines that are allnon tabs and full width are often not parsed (i.e. they are commentsetc). Lines that start hard to the left are often short and lines thatreach the right are often the ones with many tabs in them. In otherwords, many lines that are too wide, are not too long.


Am I on to something or should I be on something?

A...

A crazy idea for accurately tracking source position

Reply via email to