On Mon, 23 Feb 2026 15:47:00 +0200 Jani Nikula <[email protected]> wrote:
> On Wed, 18 Feb 2026, Mauro Carvalho Chehab <[email protected]> wrote: > > As anyone that worked before with kernel-doc are aware, using regex to > > handle C input is not great. Instead, we need something closer to how > > C statements and declarations are handled. > > > > Yet, to avoid breaking docs, I avoided touching the regex-based algorithms > > inside it with one exception: struct_group logic was using very complex > > regexes that are incompatible with Python internal "re" module. > > > > So, I came up with a different approach: NestedMatch. The logic inside > > it is meant to properly handle brackets, square brackets and parenthesis, > > which is closer to what C lexical parser does. On that time, I added > > a TODO about the need to extend that. > > There's always the question, if you're putting a lot of effort into > making kernel-doc closer to an actual C parser, why not put all that > effort into using and adapting to, you know, an actual C parser? Playing with this idea, it is not that hard to write an actual C parser - or at least a tokenizer. There is already an example of it at: https://docs.python.org/3/library/re.html I did a quick implementation, and it seems to be able to do its job: $ ./tokenizer.py ./include/net/netlink.h 1: 0 COMMENT '/* SPDX-License-Identifier: GPL-2.0 */' 2: 0 CPP '#ifndef' 2: 8 ID '__NET_NETLINK_H' 3: 0 CPP '#define' 3: 8 ID '__NET_NETLINK_H' 5: 0 CPP '#include' 5: 9 OP '<' 5: 10 ID 'linux' 5: 15 OP '/' 5: 16 ID 'types' 5: 21 PUNC '.' 5: 22 ID 'h' 5: 23 OP '>' 6: 0 CPP '#include' 6: 9 OP '<' 6: 10 ID 'linux' 6: 15 OP '/' 6: 16 ID 'netlink' 6: 23 PUNC '.' 6: 24 ID 'h' 6: 25 OP '>' 7: 0 CPP '#include' 7: 9 OP '<' 7: 10 ID 'linux' 7: 15 OP '/' 7: 16 ID 'jiffies' 7: 23 PUNC '.' 7: 24 ID 'h' 7: 25 OP '>' 8: 0 CPP '#include' 8: 9 OP '<' 8: 10 ID 'linux' 8: 15 OP '/' 8: 16 ID 'in6' ... 12: 1 COMMENT '/**\n * Standard attribute types to specify validation policy\n */' 13: 0 ENUM 'enum' 13: 5 PUNC '{' 14: 1 ID 'NLA_UNSPEC' 14: 11 PUNC ',' 15: 1 ID 'NLA_U8' 15: 7 PUNC ',' 16: 1 ID 'NLA_U16' 16: 8 PUNC ',' 17: 1 ID 'NLA_U32' 17: 8 PUNC ',' 18: 1 ID 'NLA_U64' 18: 8 PUNC ',' 19: 1 ID 'NLA_STRING' 19: 11 PUNC ',' 20: 1 ID 'NLA_FLAG' ... 41: 0 STRUCT 'struct' 41: 7 ID 'netlink_range_validation' 41: 32 PUNC '{' 42: 1 ID 'u64' 42: 5 ID 'min' 42: 8 PUNC ',' 42: 10 ID 'max' 42: 13 PUNC ';' 43: 0 PUNC '}' 43: 1 PUNC ';' 45: 0 STRUCT 'struct' 45: 7 ID 'netlink_range_validation_signed' 45: 39 PUNC '{' 46: 1 ID 's64' 46: 5 ID 'min' 46: 8 PUNC ',' 46: 10 ID 'max' 46: 13 PUNC ';' 47: 0 PUNC '}' 47: 1 PUNC ';' 49: 0 ENUM 'enum' 49: 5 ID 'nla_policy_validation' 49: 27 PUNC '{' 50: 1 ID 'NLA_VALIDATE_NONE' 50: 18 PUNC ',' 51: 1 ID 'NLA_VALIDATE_RANGE' 51: 19 PUNC ',' 52: 1 ID 'NLA_VALIDATE_RANGE_WARN_TOO_LONG' 52: 33 PUNC ',' 53: 1 ID 'NLA_VALIDATE_MIN' 53: 17 PUNC ',' 54: 1 ID 'NLA_VALIDATE_MAX' 54: 17 PUNC ',' 55: 1 ID 'NLA_VALIDATE_MASK' 55: 18 PUNC ',' 56: 1 ID 'NLA_VALIDATE_RANGE_PTR' 56: 23 PUNC ',' 57: 1 ID 'NLA_VALIDATE_FUNCTION' 57: 22 PUNC ',' 58: 0 PUNC '}' 58: 1 PUNC ';' It sounds doable to use it, and, at least on this example, it properly picked the IDs. On the other hand, using it would require lots of changes at kernel-doc. So, I guess I'll add a tokenizer to kernel-doc, but we should likely start using it gradually. Maybe starting with NestedSearch and with public/private comment handling (which is currently half-broken). As a reference, the above was generated with the code below, which was based on the Python re documentation. Comments? --- One side note: right now, we're not using typing at kernel-doc, nor really following a proper coding style. I wanted to use it during the conversion, and place consts in uppercase, as this is currently the best practices, but doing it while converting from Perl were very annoying. So, I opted to make things simpler. Now that we have it coded, perhaps it is time to define a coding style and apply it to kernel-doc. -- Thanks, Mauro #!/usr/bin/env python3 import sys import re class Token(): def __init__(self, type, value, line, column): self.type = type self.value = value self.line = line self.column = column class CTokenizer(): C_KEYWORDS = { "struct", "union", "enum", } TOKEN_LIST = [ ("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"), ("STRING", r'"(?:\\.|[^"\\])*"'), ("CHAR", r"'(?:\\.|[^'\\])'"), ("NUMBER", r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|" r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"), ("ID", r"[A-Za-z_][A-Za-z0-9_]*"), ("OP", r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%=" r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"), ("PUNC", r"[;,\.\[\]\(\)\{\}]"), ("CPP", r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"), ("HASH", r"#"), ("NEWLINE", r"\n"), ("SKIP", r"[\s]+"), ("MISMATCH",r"."), ] def __init__(self): re_tokens = [] for name, pattern in self.TOKEN_LIST: re_tokens.append(f"(?P<{name}>{pattern})") self.re_scanner = re.compile("|".join(re_tokens), re.MULTILINE | re.DOTALL) def tokenize(self, code): # Handle continuation lines code = re.sub(r"\\\n", "", code) line_num = 1 line_start = 0 for match in self.re_scanner.finditer(code): kind = match.lastgroup value = match.group() column = match.start() - line_start if kind == "NEWLINE": line_start = match.end() line_num += 1 continue if kind in {"SKIP"}: continue if kind == "MISMATCH": raise RuntimeError(f"Unexpected character {value!r} on line {line_num}") if kind == "ID" and value in self.C_KEYWORDS: kind = value.upper() # For all other tokens we keep the raw string value yield Token(kind, value, line_num, column) if __name__ == "__main__": if len(sys.argv) != 2: print(f"Usage: python {sys.argv[0]} <fname>") sys.exit(1) fname = sys.argv[1] try: with open(fname, 'r', encoding='utf-8') as file: sample = file.read() except FileNotFoundError: print(f"Error: The file '{fname}' was not found.") sys.exit(1) except Exception as e: print(f"An error occurred while reading the file: {str(e)}") sys.exit(1) print(f"Tokens from {fname}:") for tok in CTokenizer().tokenize(sample): print(f"{tok.line:3d}:{tok.column:3d} {tok.type:12} {tok.value!r}")

