How to select the regex that matches the first token of a string?

vnr via Digitalmars-d-learn Sat, 03 Jul 2021 02:11:59 -0700

Hello,

I am trying to make a small generic lexer that bases its tokenanalysis on regular expressions. The principle I have in mind isto define a token type table with its corresponding regularexpression, here is the code I currently have:


```d
import std.regex;

/// ditto
struct Token
{
    /// The token type
        string type;
    /// The regex to match the token
        Regex!char re;
    /// The matched string
        string matched = null;
}

/// Function to find the right token in the given table

Token find(Token[] table, const(Captures!string delegate(Token)pure @safe) fn)

{
        foreach (token; table)
                if (fn(token)) return token;
        return Token("", regex(r""));
}

/// The lexer class
class Lexer
{
        private Token[] tokens;

    /// ditto
        this(Token[] tkns = [])
        {
                this.tokens = tkns;
        }


        override string toString() const
        {
                import std.algorithm : map;
                import std.conv : to;
                import std.format : format;

                return to!string
                    (this.tokens.map!(tok =>

format("(%s, %s)", tok.type,tok.matched)));

        }

    // Others useful methods ...
}

/// My token table
static Token[] table =
    [ Token("NUMBER", regex(r"(?:\d+(?:\.\d*)?|\.\d+)"))
    , Token("MINS", regex(r"\-"))
    , Token("PLUS", regex(r"\+")) ];

/// Build a new lexer
Lexer lex(string text)
{
        Token[] result = [];

        while (text.length > 0)
        {
                Token token = table.find((Token t) => matchFirst(text, t.re));
                const string tmatch = matchFirst(text, token.re)[0];

                result ~= Token(token.type, token.re, tmatch);
                text = text[tmatch.length .. $];
        }
        return new Lexer(result);
}

void main()
{
    import std.stdio : writeln;

        const auto l = lex("3+2");
        writeln(l);
}

```

When I run this program, it gives the following sequence:

```
["(NUMBER, 3)", "(NUMBER, 2)", "(NUMBER, 2)"]
```

While I want this:

```
["(NUMBER, 3)", "(PLUS, +)", "(NUMBER, 2)"]
```

The problem seems to come from the `find` function which returnsthe first regex to have match and not the regex of the firstsubstring to have match (I hope I am clear enough 😅).

I'm not used to manipulating regex, especially in D, so I'm notsure how to consider a solution to this problem.


I thank you in advance for your help.

How to select the regex that matches the first token of a string?

Reply via email to