On Wed, 04 Mar 2026 12:07:45 +0200 Jani Nikula <[email protected]> wrote:
> On Mon, 23 Feb 2026, Jonathan Corbet <[email protected]> wrote: > > Jani Nikula <[email protected]> writes: > > > >> There's always the question, if you're putting a lot of effort into > >> making kernel-doc closer to an actual C parser, why not put all that > >> effort into using and adapting to, you know, an actual C parser? > > > > Not speaking to the current effort but ... in the past, when I have > > contemplated this (using, say, tree-sitter), the real problem is that > > those parsers simply strip out the comments. Kerneldoc without comments > > ... doesn't work very well. If there were a parser without those > > problems, and which could be made to do the right thing with all of our > > weird macro usage, it would certainly be worth considering. > > I think e.g. libclang and its Python bindings can be made to work. The > main problems with that are passing proper compiler options (because > it'll need to include stuff to know about types etc. because it is a > proper parser), preprocessing everything is going to take time, you need > to invest a bunch into it to know how slow exactly compared to the > current thing and whether it's prohitive, and it introduces an extra > dependency. > > So yeah, there are definitely tradeoffs there. But it's not like this > constant patching of kernel-doc is exactly burden free either. On my tests with a simple C tokenizer: https://lore.kernel.org/linux-doc/[email protected]/ The tokenizer is working fine and didn't make it much slow: it increases the time to pass the entire Kernel tree from 37s to 47s for man pages generation, but should not change much the time for htmldocs, as right now only ~4 seconds is needed to read files pointed by Documentation kernel-doc tags and parse them. The code can still be cleaned up, as there are still some things hardcoded on the various dump_* functions that could be better implemented (*). The advantage of the approach I'm using is that it allows to gradually migrate to rely at the tokenized code, as it can be done incrementally. (*) for instance, __attribute__ and a couple of other macros are parsed twice at dump_struct() logic, on different places. > I don't > know, is it just me, but I'd like to think as a profession we'd be past > writing ad hoc C parsers by now. Probably not, but I don't think we need a C parser, as kernel-doc just needs to understand data types (enum, struct, typedef, union, vars) and function/macro prototypes. For such purpose, a tokenizer sounds enough. Now, there is the code that it is now inside: https://github.com/mchehab/linux/blob/tokenizer-v5/tools/lib/python/kdoc/xforms_lists.py which contains a list of C/gcc/clang keywords that will be ignored, like: __attribute__ static extern inline Together with a sanitized version of the kernel macros it needs to handle or ignore: DECLARE_BITMAP DECLARE_HASHTABLE __acquires __init __exit struct_group ... Once we finish cleaning up kdoc_parser.py to rely only on it for prototype transformations, this will be the only file that will require changes when more macros start affecting kernel-doc. As this is complex, and may require manual adjustments, it is probably better to not try to auto-generate xforms list in runtime. A better approach is, IMO, to have a C pre-processor code to help periodically update it, like using a target like: make kdoc-xforms that would use either cpp or clang to generate a patch to update xforms_list content after adding new macros that affect docs generation. -- Thanks, Mauro

