> -----Original Message-----
> From: Mauro Carvalho Chehab <[email protected]>
> Sent: Tuesday, March 3, 2026 3:53 PM
> To: Jani Nikula <[email protected]>
> Cc: Lobakin, Aleksander <[email protected]>; Jonathan
> Corbet <[email protected]>; Kees Cook <[email protected]>; Mauro Carvalho
> Chehab <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Gustavo A. R. Silva
> <[email protected]>; Loktionov, Aleksandr
> <[email protected]>; Randy Dunlap <[email protected]>;
> Shuah Khan <[email protected]>
> Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
>
> On Mon, 23 Feb 2026 15:47:00 +0200
> Jani Nikula <[email protected]> wrote:
>
> > On Wed, 18 Feb 2026, Mauro Carvalho Chehab
> <[email protected]> wrote:
> > > As anyone that worked before with kernel-doc are aware, using
> regex
> > > to handle C input is not great. Instead, we need something closer
> to
> > > how C statements and declarations are handled.
> > >
> > > Yet, to avoid breaking docs, I avoided touching the regex-based
> > > algorithms inside it with one exception: struct_group logic was
> > > using very complex regexes that are incompatible with Python
> internal "re" module.
> > >
> > > So, I came up with a different approach: NestedMatch. The logic
> > > inside it is meant to properly handle brackets, square brackets
> and
> > > parenthesis, which is closer to what C lexical parser does. On
> that
> > > time, I added a TODO about the need to extend that.
> >
> > There's always the question, if you're putting a lot of effort into
> > making kernel-doc closer to an actual C parser, why not put all that
> > effort into using and adapting to, you know, an actual C parser?
>
> Playing with this idea, it is not that hard to write an actual C
> parser - or at least a tokenizer. There is already an example of it
> at:
>
> https://docs.python.org/3/library/re.html
>
> I did a quick implementation, and it seems to be able to do its job:
>
> $ ./tokenizer.py ./include/net/netlink.h
> 1: 0 COMMENT '/* SPDX-License-Identifier: GPL-2.0 */'
> 2: 0 CPP '#ifndef'
> 2: 8 ID '__NET_NETLINK_H'
> 3: 0 CPP '#define'
> 3: 8 ID '__NET_NETLINK_H'
> 5: 0 CPP '#include'
> 5: 9 OP '<'
> 5: 10 ID 'linux'
> 5: 15 OP '/'
> 5: 16 ID 'types'
> 5: 21 PUNC '.'
> 5: 22 ID 'h'
> 5: 23 OP '>'
> 6: 0 CPP '#include'
> 6: 9 OP '<'
> 6: 10 ID 'linux'
> 6: 15 OP '/'
> 6: 16 ID 'netlink'
> 6: 23 PUNC '.'
> 6: 24 ID 'h'
> 6: 25 OP '>'
> 7: 0 CPP '#include'
> 7: 9 OP '<'
> 7: 10 ID 'linux'
> 7: 15 OP '/'
> 7: 16 ID 'jiffies'
> 7: 23 PUNC '.'
> 7: 24 ID 'h'
> 7: 25 OP '>'
> 8: 0 CPP '#include'
> 8: 9 OP '<'
> 8: 10 ID 'linux'
> 8: 15 OP '/'
> 8: 16 ID 'in6'
> ...
> 12: 1 COMMENT '/**\n * Standard attribute types to
> specify validation policy\n */'
> 13: 0 ENUM 'enum'
> 13: 5 PUNC '{'
> 14: 1 ID 'NLA_UNSPEC'
> 14: 11 PUNC ','
> 15: 1 ID 'NLA_U8'
> 15: 7 PUNC ','
> 16: 1 ID 'NLA_U16'
> 16: 8 PUNC ','
> 17: 1 ID 'NLA_U32'
> 17: 8 PUNC ','
> 18: 1 ID 'NLA_U64'
> 18: 8 PUNC ','
> 19: 1 ID 'NLA_STRING'
> 19: 11 PUNC ','
> 20: 1 ID 'NLA_FLAG'
> ...
> 41: 0 STRUCT 'struct'
> 41: 7 ID 'netlink_range_validation'
> 41: 32 PUNC '{'
> 42: 1 ID 'u64'
> 42: 5 ID 'min'
> 42: 8 PUNC ','
> 42: 10 ID 'max'
> 42: 13 PUNC ';'
> 43: 0 PUNC '}'
> 43: 1 PUNC ';'
> 45: 0 STRUCT 'struct'
> 45: 7 ID 'netlink_range_validation_signed'
> 45: 39 PUNC '{'
> 46: 1 ID 's64'
> 46: 5 ID 'min'
> 46: 8 PUNC ','
> 46: 10 ID 'max'
> 46: 13 PUNC ';'
> 47: 0 PUNC '}'
> 47: 1 PUNC ';'
> 49: 0 ENUM 'enum'
> 49: 5 ID 'nla_policy_validation'
> 49: 27 PUNC '{'
> 50: 1 ID 'NLA_VALIDATE_NONE'
> 50: 18 PUNC ','
> 51: 1 ID 'NLA_VALIDATE_RANGE'
> 51: 19 PUNC ','
> 52: 1 ID 'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
> 52: 33 PUNC ','
> 53: 1 ID 'NLA_VALIDATE_MIN'
> 53: 17 PUNC ','
> 54: 1 ID 'NLA_VALIDATE_MAX'
> 54: 17 PUNC ','
> 55: 1 ID 'NLA_VALIDATE_MASK'
> 55: 18 PUNC ','
> 56: 1 ID 'NLA_VALIDATE_RANGE_PTR'
> 56: 23 PUNC ','
> 57: 1 ID 'NLA_VALIDATE_FUNCTION'
> 57: 22 PUNC ','
> 58: 0 PUNC '}'
> 58: 1 PUNC ';'
>
> It sounds doable to use it, and, at least on this example, it properly
> picked the IDs.
>
> On the other hand, using it would require lots of changes at kernel-
> doc. So, I guess I'll add a tokenizer to kernel-doc, but we should
> likely start using it gradually.
>
> Maybe starting with NestedSearch and with public/private comment
> handling (which is currently half-broken).
>
> As a reference, the above was generated with the code below, which was
> based on the Python re documentation.
>
> Comments?
>
> ---
>
> One side note: right now, we're not using typing at kernel-doc, nor
> really following a proper coding style.
>
> I wanted to use it during the conversion, and place consts in
> uppercase, as this is currently the best practices, but doing it while
> converting from Perl were very annoying. So, I opted to make things
> simpler. Now that we have it coded, perhaps it is time to define a
> coding style and apply it to kernel-doc.
>
> --
> Thanks,
> Mauro
>
> #!/usr/bin/env python3
>
> import sys
> import re
>
> class Token():
> def __init__(self, type, value, line, column):
> self.type = type
> self.value = value
> self.line = line
> self.column = column
>
> class CTokenizer():
> C_KEYWORDS = {
> "struct", "union", "enum",
> }
>
> TOKEN_LIST = [
> ("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),
>
> ("STRING", r'"(?:\\.|[^"\\])*"'),
> ("CHAR", r"'(?:\\.|[^'\\])'"),
>
> ("NUMBER", r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
>
> ("ID", r"[A-Za-z_][A-Za-z0-9_]*"),
>
> ("OP", r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-
> =|\*=|/=|%="
> r"|&=|\|=|\^=|=|\+|\-
> |\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
>
> ("PUNC", r"[;,\.\[\]\(\)\{\}]"),
>
> ("CPP",
> r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),
>
> ("HASH", r"#"),
>
> ("NEWLINE", r"\n"),
>
> ("SKIP", r"[\s]+"),
>
> ("MISMATCH",r"."),
> ]
>
> def __init__(self):
> re_tokens = []
>
> for name, pattern in self.TOKEN_LIST:
> re_tokens.append(f"(?P<{name}>{pattern})")
>
> self.re_scanner = re.compile("|".join(re_tokens),
> re.MULTILINE | re.DOTALL)
>
> def tokenize(self, code):
> # Handle continuation lines
> code = re.sub(r"\\\n", "", code)
>
> line_num = 1
> line_start = 0
>
> for match in self.re_scanner.finditer(code):
> kind = match.lastgroup
> value = match.group()
> column = match.start() - line_start
>
> if kind == "NEWLINE":
> line_start = match.end()
> line_num += 1
> continue
>
> if kind in {"SKIP"}:
> continue
>
> if kind == "MISMATCH":
> raise RuntimeError(f"Unexpected character {value!r} on
> line {line_num}")
>
> if kind == "ID" and value in self.C_KEYWORDS:
> kind = value.upper()
>
> # For all other tokens we keep the raw string value
> yield Token(kind, value, line_num, column)
>
> if __name__ == "__main__":
> if len(sys.argv) != 2:
> print(f"Usage: python {sys.argv[0]} <fname>")
> sys.exit(1)
>
> fname = sys.argv[1]
>
> try:
> with open(fname, 'r', encoding='utf-8') as file:
> sample = file.read()
> except FileNotFoundError:
> print(f"Error: The file '{fname}' was not found.")
> sys.exit(1)
> except Exception as e:
> print(f"An error occurred while reading the file: {str(e)}")
> sys.exit(1)
>
> print(f"Tokens from {fname}:")
>
> for tok in CTokenizer().tokenize(sample):
> print(f"{tok.line:3d}:{tok.column:3d} {tok.type:12}
> {tok.value!r}")
As hobby C compiler writer, I must say that you need to implement C
preprocessor first, because C preprocessor influences/changes the syntax.
In your tokenizer I see right away that any line which begins from '#' must be
just as C preprocessor command without further tokenizing.
But the real pain make C preprocessor substitutions IMHO