> Perhaps the sscanf method can find strings and comments first and not look 
> for other stuff inside them?

The regices(?) defined in upstream ctags are meant to match from the beginning 
of the line (notice the `^` at the beginning), so they're not going to 
accidentally match things inside an end-of-line comment or a string.

If you want to ignore definitions inside _block comments_ (and I considered 
modifying my current PR or creating a new one for that) then the good news is 
that that's relatively easy to do, since a block comment is always going to be 
delimited by lines containing **only** `%{` and `%}` (possibly with some 
whitespace) and nothing else, so there's no risk to accidentally interpret the 
content of a string or comment as a block comment delimiter.  The only 
difficulty in this regard is that block comments can be nested (think of C's 
`#if 0 ... #endif`), but that's easy to solve with a counter.

As for end-of-line comments, those always start with `%` or `...` (excluding 
those in a string, but you won't find strings in function definitions), so they 
are easy to avoid.  And strings don't span multiple lines.  So I think that 
other than the rare `%{`...`%}` case, it should be straightforward.

> Don't know how fast sscanf is, its possible it won't be any faster than 
> regexes since its still **scanning the input more than once.** Thats what is 
> slow for regex parsers, the fact that multiple regexes are applied to the 
> same input, not the fact that a well optimised regex library like PCRE is 
> slow. Repeat scans is one thing well written character by character parsers 
> try not to do.

My understanding is that the scanf family functions only need to parse one 
character at a time, and **never backtrack or look ahead** more than one 
character, so they don't need to perform multiple passes on the input.  For 
example, if you do `scanf("%d", &i)`, that's going to read character by 
character from stdin, and as soon as it finds a character that doesn't match an 
integer (it's not a digit or a leading whitespace/sign), it'll put that one 
character back into stdin (see `ungetc()`) and return.  Think of it as a 
[possessive regex](https://www.regular-expressions.info/possessive.html).
Regular expressions, on the other hand, have to try to match different 
combinations until one of them works (unless they're 
[possessive](https://www.regular-expressions.info/possessive.html)), hence the 
need for backtracking and multiple passes and inefficiencies.

For example, when matching the string `abcde123` against the regex 
`([a-z]+)([aeiou]+)` it will succeed, because the first capture group will 
match `abcd` (even if it could match all of `abcde`) and the second part will 
match `e`.  But the scanf pattern `%[a-z]%[aeiou]` will fail, because the first 
specifier will just eat all the letters and not leave any for the second.

So in other words, `sscanf()` is a glorified char-by-char parser that only 
needs to look ahead one character at most.  And it's also probably very well 
optimized, so it might be even faster than writing the parsing state machine 
yourself.  Maybe we could just write some tests and time them.

> The writers of the Python parser took the "every assignment is a declaration" 
> approach and the Julia parser writers took the approach "no assignment is a 
> declaration". So Python is a precedent for having all the names available, 
> and I haven't seen too many complaints about it.

I had never noticed this, and it feels kinda wrong that the same variable can 
be "declared" in multiple places, but then again, Python programs are usually a 
bunch of functions/classes with maybe a few "file-scope" variables, and the 
parser ignores assignments performed inside a function (which are local to the 
function).  So for a typical Python file structure, it makes sense to assume 
that every assignment done outside of a function is some sort of "global" 
variable.  But one could also make a Python script where most/all the code is 
outside of a function and is executed directly (this would look a bit ugly in 
Geany because of all the variables, but that's the price to pay if we want a 
"normal" module-like Python file to look good).

Similarly, Matlab files can be of two types: either "scripts" where all the 
code inside is executed or "functions" containing one or more function 
definitions.  Maybe we can just disable variable assignment detection when 
inside a function (i.e., when a line defining a function has been scanned 
before), and then we'd have Python's behavior.

However, I'd argue that in the case of Matlab files, there's nothing similar to 
C's "file-scope variables" since you'll never mix function definitions and 
variable declarations, so maybe it's a bit pointless to parse variables.  But 
I'm OK with it as long as the ones in functions are excluded.

> > Honestly I think I'd entirely ditch parsing structs;
> 
> Don't know matlab enough to comment, but if no Matlabbers object I guess its 
> ok if upstream doesn't do it.

I have some experience with Matlab and I'd say structs aren't used that often, 
or at least I don't use them often (and they're far from the only type of data 
structure).  Plus using `struct` explicitly isn't the only way to declare a 
struct (and I'd say it's rarely done); just doing `a.b.c.d = 42` will declare 
`a` as a struct containing a struct containing a struct if `a` doesn't exist 
yet.  I think in practice I'd only ever use `struct` explicitly if I wanted to 
"reset" a struct variable to the empty struct.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/pull/3358#issuecomment-1370169914
You are receiving this because you are subscribed to this thread.

Message ID: <geany/geany/pull/3358/[email protected]>

Reply via email to