On Tue, 03 Mar 2026 13:24:55 -0700 Jonathan Corbet <[email protected]> wrote:
> Mauro Carvalho Chehab <[email protected]> writes: > > > Heh, if we start using a code like the tokenizer I'm experimenting > > here: > > > > https://lore.kernel.org/linux-doc/20260303155310.5235b367@localhost/ > > > > we could probably get rid of regexes in the future, using instead > > a loop that would be picking "ID" tokens, e.g. basically we would > > have something similar to this completely untested code snippet: > > ...which has some appeal, but I will confess to being leery of another > massive kernel-doc thrash; it would be nice if things settled for a bit. Yeah, I feel your pain. The idea is not to simply rewrite the entire kernel-doc. Just to use it at the places that have known hard to solve bugs. > One can always dream :) The thing is that there are some issues at kernel-doc that can only be solved with a better parser, and a plain regex logic won't fix, even with really complex expressions. I'm not talking about variable handling like on this specific patch. For it, the current pure regex approach works fine, at least for the cases we already mapped. However, kernel-doc, even after this series, do a crap job on several places: 1. Macros used to build structs and function prototypes. During the conversion I wrote a half-baked NestedMatch class to be able of properly handling struct_group*(), which is the best example of the involved complexity. It works, but it requires parsing the code twice. Also, It probably will fail with nested struct_group; 2. Nested structs. Current logic just transform them on an un-nested struct-like pseudo-code to re-use the structs regex-based parser; 3. Nested struct identifiers handling. Spec says that if one has: struct { struct { int foo; } a; } b; kernel-doc should document "foo" as "a.foo", but this is not always the case, due to bugs at the parser. So, on some places, you need to use "foo"; on others, "a.foo". 4. Public/private handling. Code almost works, but when it finds a unmatched private inside a nested struct, it will can hide close brackets. This prevents fixing (2). 5. Comments strip. Code kinda works if you don't touch it, but when trying to solve the previous issues, I ended discovering some hidden problems related to the way it does. (That's basically when I ended opting to try a different approach: too much changes to try to live with a plain regex approach, plus all stuff needed for NestedMatch to do a better job) 6. Proper whitespace and alignment at the output. The current way we parse things mean that little changes end mangling with whitespaces, line breaks and/or indentation. Perhaps we could use some token-based formatter for man pages to properly handle open/close brackets. For rst output, we're relying at the C domain to handle it for us. Still, perhaps a tokenizer-based approach can just add a single whitespace everywhere, which would help us to check before/after differences on kernel-doc changes. I attempted fixing this at the /38 patch series and afterwards (good news that that we have 18 less patches after you merged this one), but my current pending patch pile has stlll +40 patches to address issues and add unit tests. Among them, I have changes: - adding support for "\1", "\2", "\3"... group matches at NestedMatch sub. It works. The code itself is small, but very complex; - writing a different logic to address comment stripping; - write a logic to pick struct IDs on nested cases that will be using an approach similar to NestedMatch. - some of those new logic are recursive, which makes them more complex to be understood and tested. Such approach works, but, as Jani pointed, this ends adding lots of complexity, and the main reason is that we're acquainted to use regexes - or perhaps too afraid to handle it on a different way. Also, they ended introducing extra bugs. I had to confess that I also a little reticent on trying to use a tokenizer, as I was afraid that this would require extra libraries to have something similar to what flex/bison would be doing. Then, I realized that perhaps the already internal libraries might just have what we needed. So, after some research, looking at: https://docs.python.org/3/library/re.html#writing-a-tokenizer and doing some (so far simple) tests, I'm starting to think about modifying my pending patches to use a code similar to it. Note that, at this point, I didn't try yet to use the tokenizer, so for now this is still mostly a brainstorm. I intend to try to use the tokenizer to handle comments e.g. touching the logic at trim_private_members() and see how it behaves. This is self-contained, checking the results would likely be easy, and I don't expect big surprises. Depending on such tests, I may try to modify NestedMatch and my patches using it to use the same approach. So, if all ends well, the changes will so far be confined to the code used by dump_struct(). Let's see. Thanks, Mauro

