Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer

Randy Dunlap Mon, 16 Mar 2026 16:29:51 -0700

Uh, I find this review confusing.
Do your (Jon) comments refer to the code above them?
(more below)



On 3/16/26 4:03 PM, Jonathan Corbet wrote:
> On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab 
> <[email protected]> wrote:
>> Handling C code purely using regular expressions doesn't work well.
>>
>> Add a C tokenizer to help doing it the right way.
>>
>> The tokenizer was written using as basis the Python re documentation
>> tokenizer example from:
>>      https://docs.python.org/3/library/re.html#writing-a-tokenizer
>>
>> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
>> Message-ID: 
>> <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+hua...@kernel.org>
>> Message-ID: 
>> <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+hua...@kernel.org>
> 
> This is a combined effort to review this patch and to try out "b4 review",
> we'll see how it goes :).
> 
>> diff --git a/tools/lib/python/kdoc/kdoc_re.py 
>> b/tools/lib/python/kdoc/kdoc_re.py
>> index 085b89a4547c0..7bed4e9a88108 100644
>> --- a/tools/lib/python/kdoc/kdoc_re.py
>> +++ b/tools/lib/python/kdoc/kdoc_re.py
>> @@ -141,6 +141,240 @@ class KernRe:
>> [ ... skip 4 lines ... ]
>> +
>> +    @staticmethod
>> +    def __str__(val):
>> +        """Return the name of an enum value"""
>> +        return TokType._name_by_val.get(val, f"UNKNOWN({val})")
>> +
> 
> What is this class supposed to do?
> 
>> [ ... skip 27 lines ... ]
>> +    _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, 
>> int)}
>> +
>> +    # Dict to convert from string to an enum-like integer value.
>> +    _name_to_val = {k: v for v, k in _name_by_val.items()}
>> +
>> +    @staticmethod
> 
> This stuff strikes me as a bit overdone; _name_to_val is really just the
> variable list for the class, right?
> 
>> [ ... skip 30 lines ... ]
>> +               f"{self.brace_level}, {self.paren_level}, 
>> {self.bracket_level})"
>> +
>> +#: Tokens to parse C code.
>> +TOKEN_LIST = [
>> +    (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
>> +
> 
> So these aren't "tokens", this is a list of regexes; how is it intended
> to be used?
> 
>> +    (CToken.STRING,  r'"(?:\\.|[^"\\])*"'),
>> +    (CToken.CHAR,    r"'(?:\\.|[^'\\])'"),
>> +
>> +    (CToken.NUMBER,  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> 
> How does "[\s\S]*" differ from plain old "*" ?
> 
>> [ ... skip 15 lines ... ]
>> +    (CToken.STRUCT,  r"\bstruct\b"),
>> +    (CToken.UNION,   r"\bunion\b"),
>> +    (CToken.ENUM,    r"\benum\b"),
>> +    (CToken.TYPEDEF, r"\bkinddef\b"),
>> +
>> +    (CToken.NAME,      r"[A-Za-z_][A-Za-z0-9_]*"),
> 
> "-" and "!" never need to be escaped.
> 
>> +
>> +    (CToken.SPACE,   r"[\s]+"),
>> +
>> +    (CToken.MISMATCH,r"."),
>> +]
>> +
> 
> "kinddef" ?

What does that refer to?

> 
>> +#: Handle C continuation lines.
>> +RE_CONT = KernRe(r"\\\n")
>> +
>> +RE_COMMENT_START = KernRe(r'/\*\s*')
>> +
> 
> Don't need the [brackets] here

what brackets?

> 
>> [ ... skip 6 lines ... ]
>> +
>> +    When converted to string, it drops comments and handle public/private
>> +    values, respecting depth.
>> +    """
>> +
>> +    # This class is inspired and follows the basic concepts of:
> 
> That seems weird, why don't you just initialize it here?

I can't tell what that comments refers to.

>> [ ... skip 14 lines ... ]
>> +        source = RE_CONT.sub("", source)
>> +
>> +        brace_level = 0
>> +        paren_level = 0
>> +        bracket_level = 0
>> +
> 
> Do you mean "iterator" here?

Ditto.

>> [ ... skip 33 lines ... ]
>> +        in this particular case, it makes sense, as we can pick the name
>> +        when matching a code via re_scanner().
>> +        """
>> +        global re_scanner
>> +
>> +        if not re_scanner:
> 
> Putting __init__() first is fairly standard, methinks.
> 
>> [ ... skip 15 lines ... ]
>> +
>> +        for tok in self.tokens:
>> +            if tok.kind == CToken.BEGIN:
>> +                show_stack.append(show_stack[-1])
>> +
>> +            elif tok.kind == CToken.END:
> 
> I still don't understand why you do this here - this is all constant, right?
> 
>> +                prev = show_stack[-1]
>> +                if len(show_stack) > 1:
>> +                    show_stack.pop()
>> +
>> +                if not prev and show_stack[-1]:
> 
> So you create a nice iterator structure, then just put it all together into a
> list anyway?
> 

-- 
~Randy

Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer

Reply via email to