> -----Original Message-----
> From: Mauro Carvalho Chehab <[email protected]>
> Sent: Tuesday, March 3, 2026 3:53 PM
> To: Jani Nikula <[email protected]>
> Cc: Lobakin, Aleksander <[email protected]>; Jonathan
> Corbet <[email protected]>; Kees Cook <[email protected]>; Mauro Carvalho
> Chehab <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Gustavo A. R. Silva
> <[email protected]>; Loktionov, Aleksandr
> <[email protected]>; Randy Dunlap <[email protected]>;
> Shuah Khan <[email protected]>
> Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
> 
> On Mon, 23 Feb 2026 15:47:00 +0200
> Jani Nikula <[email protected]> wrote:
> 
> > On Wed, 18 Feb 2026, Mauro Carvalho Chehab
> <[email protected]> wrote:
> > > As anyone that worked before with kernel-doc are aware, using
> regex
> > > to handle C input is not great. Instead, we need something closer
> to
> > > how C statements and declarations are handled.
> > >
> > > Yet, to avoid breaking  docs, I avoided touching the regex-based
> > > algorithms inside it with one exception: struct_group logic was
> > > using very complex regexes that are incompatible with Python
> internal "re" module.
> > >
> > > So, I came up with a different approach: NestedMatch. The logic
> > > inside it is meant to properly handle brackets, square brackets
> and
> > > parenthesis, which is closer to what C lexical parser does. On
> that
> > > time, I added a TODO about the need to extend that.
> >
> > There's always the question, if you're putting a lot of effort into
> > making kernel-doc closer to an actual C parser, why not put all that
> > effort into using and adapting to, you know, an actual C parser?
> 
> Playing with this idea, it is not that hard to write an actual C
> parser - or at least a tokenizer. There is already an example of it
> at:
> 
>       https://docs.python.org/3/library/re.html
> 
> I did a quick implementation, and it seems to be able to do its job:
> 
>     $ ./tokenizer.py ./include/net/netlink.h
>       1:  0  COMMENT       '/* SPDX-License-Identifier: GPL-2.0 */'
>       2:  0  CPP           '#ifndef'
>       2:  8  ID            '__NET_NETLINK_H'
>       3:  0  CPP           '#define'
>       3:  8  ID            '__NET_NETLINK_H'
>       5:  0  CPP           '#include'
>       5:  9  OP            '<'
>       5: 10  ID            'linux'
>       5: 15  OP            '/'
>       5: 16  ID            'types'
>       5: 21  PUNC          '.'
>       5: 22  ID            'h'
>       5: 23  OP            '>'
>       6:  0  CPP           '#include'
>       6:  9  OP            '<'
>       6: 10  ID            'linux'
>       6: 15  OP            '/'
>       6: 16  ID            'netlink'
>       6: 23  PUNC          '.'
>       6: 24  ID            'h'
>       6: 25  OP            '>'
>       7:  0  CPP           '#include'
>       7:  9  OP            '<'
>       7: 10  ID            'linux'
>       7: 15  OP            '/'
>       7: 16  ID            'jiffies'
>       7: 23  PUNC          '.'
>       7: 24  ID            'h'
>       7: 25  OP            '>'
>       8:  0  CPP           '#include'
>       8:  9  OP            '<'
>       8: 10  ID            'linux'
>       8: 15  OP            '/'
>       8: 16  ID            'in6'
> ...
>      12:  1  COMMENT       '/**\n  * Standard attribute types to
> specify validation policy\n  */'
>      13:  0  ENUM          'enum'
>      13:  5  PUNC          '{'
>      14:  1  ID            'NLA_UNSPEC'
>      14: 11  PUNC          ','
>      15:  1  ID            'NLA_U8'
>      15:  7  PUNC          ','
>      16:  1  ID            'NLA_U16'
>      16:  8  PUNC          ','
>      17:  1  ID            'NLA_U32'
>      17:  8  PUNC          ','
>      18:  1  ID            'NLA_U64'
>      18:  8  PUNC          ','
>      19:  1  ID            'NLA_STRING'
>      19: 11  PUNC          ','
>      20:  1  ID            'NLA_FLAG'
> ...
>      41:  0  STRUCT        'struct'
>      41:  7  ID            'netlink_range_validation'
>      41: 32  PUNC          '{'
>      42:  1  ID            'u64'
>      42:  5  ID            'min'
>      42:  8  PUNC          ','
>      42: 10  ID            'max'
>      42: 13  PUNC          ';'
>      43:  0  PUNC          '}'
>      43:  1  PUNC          ';'
>      45:  0  STRUCT        'struct'
>      45:  7  ID            'netlink_range_validation_signed'
>      45: 39  PUNC          '{'
>      46:  1  ID            's64'
>      46:  5  ID            'min'
>      46:  8  PUNC          ','
>      46: 10  ID            'max'
>      46: 13  PUNC          ';'
>      47:  0  PUNC          '}'
>      47:  1  PUNC          ';'
>      49:  0  ENUM          'enum'
>      49:  5  ID            'nla_policy_validation'
>      49: 27  PUNC          '{'
>      50:  1  ID            'NLA_VALIDATE_NONE'
>      50: 18  PUNC          ','
>      51:  1  ID            'NLA_VALIDATE_RANGE'
>      51: 19  PUNC          ','
>      52:  1  ID            'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
>      52: 33  PUNC          ','
>      53:  1  ID            'NLA_VALIDATE_MIN'
>      53: 17  PUNC          ','
>      54:  1  ID            'NLA_VALIDATE_MAX'
>      54: 17  PUNC          ','
>      55:  1  ID            'NLA_VALIDATE_MASK'
>      55: 18  PUNC          ','
>      56:  1  ID            'NLA_VALIDATE_RANGE_PTR'
>      56: 23  PUNC          ','
>      57:  1  ID            'NLA_VALIDATE_FUNCTION'
>      57: 22  PUNC          ','
>      58:  0  PUNC          '}'
>      58:  1  PUNC          ';'
> 
> It sounds doable to use it, and, at least on this example, it properly
> picked the IDs.
> 
> On the other hand, using it would require lots of changes at kernel-
> doc. So, I guess I'll add a tokenizer to kernel-doc, but we should
> likely start using it gradually.
> 
> Maybe starting with NestedSearch and with public/private comment
> handling (which is currently half-broken).
> 
> As a reference, the above was generated with the code below, which was
> based on the Python re documentation.
> 
> Comments?
> 
> ---
> 
> One side note: right now, we're not using typing at kernel-doc, nor
> really following a proper coding style.
> 
> I wanted to use it during the conversion, and place consts in
> uppercase, as this is currently the best practices, but doing it while
> converting from Perl were very annoying. So, I opted to make things
> simpler. Now that we have it coded, perhaps it is time to define a
> coding style and apply it to kernel-doc.
> 
> --
> Thanks,
> Mauro
> 
> #!/usr/bin/env python3
> 
> import sys
> import re
> 
> class Token():
>     def __init__(self, type, value, line, column):
>         self.type = type
>         self.value = value
>         self.line = line
>         self.column = column
> 
> class CTokenizer():
>     C_KEYWORDS = {
>         "struct", "union", "enum",
>     }
> 
>     TOKEN_LIST = [
>         ("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),
> 
>         ("STRING",  r'"(?:\\.|[^"\\])*"'),
>         ("CHAR",    r"'(?:\\.|[^'\\])'"),
> 
>         ("NUMBER",  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
>                     r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
> 
>         ("ID",      r"[A-Za-z_][A-Za-z0-9_]*"),
> 
>         ("OP",      r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-
> =|\*=|/=|%="
>                     r"|&=|\|=|\^=|=|\+|\-
> |\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
> 
>         ("PUNC",    r"[;,\.\[\]\(\)\{\}]"),
> 
>         ("CPP",
> r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),
> 
>         ("HASH",    r"#"),
> 
>         ("NEWLINE", r"\n"),
> 
>         ("SKIP",    r"[\s]+"),
> 
>         ("MISMATCH",r"."),
>     ]
> 
>     def __init__(self):
>         re_tokens = []
> 
>         for name, pattern in self.TOKEN_LIST:
>             re_tokens.append(f"(?P<{name}>{pattern})")
> 
>         self.re_scanner = re.compile("|".join(re_tokens),
>                                      re.MULTILINE | re.DOTALL)
> 
>     def tokenize(self, code):
>         # Handle continuation lines
>         code = re.sub(r"\\\n", "", code)
> 
>         line_num = 1
>         line_start = 0
> 
>         for match in self.re_scanner.finditer(code):
>             kind   = match.lastgroup
>             value  = match.group()
>             column = match.start() - line_start
> 
>             if kind == "NEWLINE":
>                 line_start = match.end()
>                 line_num += 1
>                 continue
> 
>             if kind in {"SKIP"}:
>                 continue
> 
>             if kind == "MISMATCH":
>                 raise RuntimeError(f"Unexpected character {value!r} on
> line {line_num}")
> 
>             if kind == "ID" and value in self.C_KEYWORDS:
>                 kind = value.upper()
> 
>             # For all other tokens we keep the raw string value
>             yield Token(kind, value, line_num, column)
> 
> if __name__ == "__main__":
>     if len(sys.argv) != 2:
>         print(f"Usage: python {sys.argv[0]} <fname>")
>         sys.exit(1)
> 
>     fname = sys.argv[1]
> 
>     try:
>         with open(fname, 'r', encoding='utf-8') as file:
>             sample = file.read()
>     except FileNotFoundError:
>         print(f"Error: The file '{fname}' was not found.")
>         sys.exit(1)
>     except Exception as e:
>         print(f"An error occurred while reading the file: {str(e)}")
>         sys.exit(1)
> 
>     print(f"Tokens from {fname}:")
> 
>     for tok in CTokenizer().tokenize(sample):
>         print(f"{tok.line:3d}:{tok.column:3d}  {tok.type:12}
> {tok.value!r}")

As hobby C compiler writer, I must say that you need to implement C 
preprocessor first, because C preprocessor influences/changes the syntax.
In your tokenizer I see right away that any line which begins from '#' must be 
just as C preprocessor command without further tokenizing.
But the real pain make C preprocessor substitutions IMHO



Reply via email to