> Re: speed, I see four "levels" in which the parser can be implemented:
> 1. Compare characters one by one

Definitely fastest but not very readable and if you wanted to go this way, it 
would be better to convert the parser into the token-based parser (i.e. the 
"proper" parser). Such parsers first split an input like
```
function y = func4(a, b)
```
into tokens like these
1. `function` - keyword
2. `y` - identifier
3. `=` - `=` operator
4. `func4` - identifier
5. `(`
6. `a` - identifier
7. `,`
8. `b` - identifer
9. `)`

first and then perform analysis on top of these pre-parsed tokens. Also when 
creating these tokens, these parsers skip things like whitespace or comments so 
you don't have to worry about these in the rest of the code. When creating 
these tokens, the parsers read the input character by character and do the 
necessary comparisons character-wise so they are very fast. In ctags these are 
all the parsers that don't use `readLineFromInputFile()` or regular expressions.

This is definitely the way to go if you want the best possible parser - but 
they require more time to implement and you'd have to rewrite the current 
implementation of the Matlab parser from scratch.

`readLine()` based parsers are definitely shittier but often just fine if the 
language isn't too crazy.

> 2. strncmp() and strstr()

This is used in most ctags `readLine()` based parsers.

> 3. sscanf() 💡

Like Lex, I'm not entirely sure by the performance of this - even though you 
don't have to backtrack, I'm not sure how these rules are evaluated and if it's 
fast enough. Also, personally, I'd prefer just plain C code that does this 
stuff - it's more readable and it can be reused - you can remove the whole 
string behind `%` in C first and then the rest of the code doesn't have to care 
about this any more (this is one of the typical simplifications of `readLine()` 
based parsers - `%` could be inside of a string in which case you shouldn't do 
this).

What's sure is that ctags parsers don't really use this method.

> 4. Regular expressions

Regexii (as the ancient Romans commonly called them) are probably slowest and 
also least flexible but fastest to write and better than no parser at all.

> Honestly I think I'd entirely ditch parsing structs; knowing that a certain 
> variable at a certain point in the program is a struct isn't really that 
> relevant, and universal-ctags doesn't do it anyway. Class parsing would be 
> more useful.

I'm not a Matlab user but I guess this is probably fine for Geany.

> For a similar reason, I'd avoid parsing all variables as universal-ctags 
> does; having a list with EVERY variable assignment in EVERY function in the 
> script seems excessive. (However, it might be a good idea to list global and 
> persistent variables.)

This is where you might run into a problem in universal ctags - the "kinds" 
ctags support is kind of an interface and dropping it means 
backwards-incompatible change. In any case, before you spend more time on this 
parser, I'd suggest opening an issue in the universal-ctags project describing 
which way you want to proceed and asking if it's fine to avoid some unnecessary 
work (the maintainer of the project tends to be very responsive and supportive).


-- 
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/pull/3358#issuecomment-1370294939
You are receiving this because you are subscribed to this thread.

Message ID: <geany/geany/pull/3358/[email protected]>

Reply via email to