> Perhaps the sscanf method can find strings and comments first and not look
> for other stuff inside them?
The regices(?) defined in upstream ctags are meant to match from the beginning
of the line (notice the `^` at the beginning), so they're not going to
accidentally match things inside an end-of-line comment or a string.
If you want to ignore definitions inside _block comments_ (and I considered
modifying my current PR or creating a new one for that) then the good news is
that that's relatively easy to do, since a block comment is always going to be
delimited by lines containing **only** `%{` and `%}` (possibly with some
whitespace) and nothing else, so there's no risk to accidentally interpret the
content of a string or comment as a block comment delimiter. The only
difficulty in this regard is that block comments can be nested (think of C's
`#if 0 ... #endif`), but that's easy to solve with a counter.
As for end-of-line comments, those always start with `%` or `...` (excluding
those in a string, but you won't find strings in function definitions), so they
are easy to avoid. And strings don't span multiple lines. So I think that
other than the rare `%{`...`%}` case, it should be straightforward.
> Don't know how fast sscanf is, its possible it won't be any faster than
> regexes since its still **scanning the input more than once.** Thats what is
> slow for regex parsers, the fact that multiple regexes are applied to the
> same input, not the fact that a well optimised regex library like PCRE is
> slow. Repeat scans is one thing well written character by character parsers
> try not to do.
My understanding is that the scanf family functions only need to parse one
character at a time, and **never backtrack or look ahead** more than one
character, so they don't need to perform multiple passes on the input. For
example, if you do `scanf("%d", &i)`, that's going to read character by
character from stdin, and as soon as it finds a character that doesn't match an
integer (it's not a digit or a leading whitespace/sign), it'll put that one
character back into stdin (see `ungetc()`) and return. Think of it as a
[possessive regex](https://www.regular-expressions.info/possessive.html).
Regular expressions, on the other hand, have to try to match different
combinations until one of them works (unless they're
[possessive](https://www.regular-expressions.info/possessive.html)), hence the
need for backtracking and multiple passes and inefficiencies.
For example, when matching the string `abcde123` against the regex
`([a-z]+)([aeiou]+)` it will succeed, because the first capture group will
match `abcd` (even if it could match all of `abcde`) and the second part will
match `e`. But the scanf pattern `%[a-z]%[aeiou]` will fail, because the first
specifier will just eat all the letters and not leave any for the second.
So in other words, `sscanf()` is a glorified char-by-char parser that only
needs to look ahead one character at most. And it's also probably very well
optimized, so it might be even faster than writing the parsing state machine
yourself. Maybe we could just write some tests and time them.
> The writers of the Python parser took the "every assignment is a declaration"
> approach and the Julia parser writers took the approach "no assignment is a
> declaration". So Python is a precedent for having all the names available,
> and I haven't seen too many complaints about it.
I had never noticed this, and it feels kinda wrong that the same variable can
be "declared" in multiple places, but then again, Python programs are usually a
bunch of functions/classes with maybe a few "file-scope" variables, and the
parser ignores assignments performed inside a function (which are local to the
function). So for a typical Python file structure, it makes sense to assume
that every assignment done outside of a function is some sort of "global"
variable. But one could also make a Python script where most/all the code is
outside of a function and is executed directly (this would look a bit ugly in
Geany because of all the variables, but that's the price to pay if we want a
"normal" module-like Python file to look good).
Similarly, Matlab files can be of two types: either "scripts" where all the
code inside is executed or "functions" containing one or more function
definitions. Maybe we can just disable variable assignment detection when
inside a function (i.e., when a line defining a function has been scanned
before), and then we'd have Python's behavior.
However, I'd argue that in the case of Matlab files, there's nothing similar to
C's "file-scope variables" since you'll never mix function definitions and
variable declarations, so maybe it's a bit pointless to parse variables. But
I'm OK with it as long as the ones in functions are excluded.
> > Honestly I think I'd entirely ditch parsing structs;
>
> Don't know matlab enough to comment, but if no Matlabbers object I guess its
> ok if upstream doesn't do it.
I have some experience with Matlab and I'd say structs aren't used that often,
or at least I don't use them often (and they're far from the only type of data
structure). Plus using `struct` explicitly isn't the only way to declare a
struct (and I'd say it's rarely done); just doing `a.b.c.d = 42` will declare
`a` as a struct containing a struct containing a struct if `a` doesn't exist
yet. I think in practice I'd only ever use `struct` explicitly if I wanted to
"reset" a struct variable to the empty struct.
--
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/pull/3358#issuecomment-1370169914
You are receiving this because you are subscribed to this thread.
Message ID: <geany/geany/pull/3358/[email protected]>