basic is parsable, obviously the interpreter has to do it, but the
difference is to parse basic you pretty much need a state machine, pretty
much the same way the actual interpreter works.

You can't really determine what any given digit or double-quote or almost
anything is, except by starting from the beginning and keeping track of
what you have encountered one byte at a time all along the way.

You enter and exit modes or states serially, at least within a line, and
the rules are the weird rules of basic not any generic rules, ie how a
string of text might be several different things packed together, only
recognized as separate things by recognizing that IFA is not a keyword but
IF is, and so A must be an argument, but not if this all doesn't appear in
the main context outside of quotes or comments or a data statement.

Or how a THEN or ELSE branch ends at the end of the line, which is no
different than the end of any other line, not with some explicit ENDIF or
braces.

bkw

On Fri, Mar 3, 2023, 3:47 AM B 9 <[email protected]> wrote:

> On Wed, Mar 1, 2023 at 11:00 PM John R. Hogerhuis <[email protected]>
> wrote:
>
>> Well, for ANSI C 99 lex is probably the best way. C doesn't (didn't?)
>> have any regex engine. I don't know what's in modern C, it might have
>> libraries for regex.
>>
>
> I've used the PCRE library in C, which works but is not as nice as a
> language which is designed from the ground up to use regular expressions.
> While the modern languages that I know technically have regular
> expressions, most of them treat regexes as a weird string which can be used
> in a function call, no different than a library in C. Perl and Python are
> noteworthy here: Perl for being surprisingly good at integrating regular
> expressions into the design of the language and Python for failing quite
> badly when they should have known better. I don't know if there are any
> modern languages that put regular expressions at the heart in the way lex,
> awk, and sed do. I'd love to learn if anyone knows.
>
> If someone is using Python, Perl, C#, Java, Javascript, etc... my
>> minimalist tendency would be to forgo the dependency on a lexer library
>>
>
> If you are speaking of how the compiled executable can literally depend
> upon the lexer library to run, it turns out that is not necessary. But,
> yes, I do see your broader point of wanting to write directly in a language
> instead of using a meta-language.
>
>
> and just use the regex language feature to do the work. Since you can
>> create one being expression with a named line number token, followed by a
>> series of BASIC lexemes as one big list of "|"'s
>>
>
> If the language's regex features made doing so easy, I'd be all for it.
> Unfortunately, as far as I know, they don't. Consider "mini-scanners" where
> part of the input is syntactically different from the rest. For example,
> how do you handle REM or double quotes? It is totally possible to do it in
> a giant regex, but not by me. I mean, I could probably come up with
> something that seems to work, but I'm not sure I have enough skill (or
> patience) to do it right. I would end up kludging it with extra code that
> might work, but certainly would offend anyone's sense of minimalism.
>
> With lex, mini-scanners are trivial. For example, here is the entirety of
> the code needed to handle REM in my tokenizer:
>
> %x remark     /* remark is an exclusive start condition */REM         
> putchar(142);   BEGIN(remark);
> <*>\n         putchar('\0');  BEGIN(INITIAL);
>
> I simply defined an exclusive start condition named <remark> and have the
> REM statement enter into it. Lex automatically copies verbatim any text
> that doesn't match any rules. The only rule that matches <remark> is
> newline, which returns the scanner to the normal start condition. Double
> quotes are just as easy.
>
>
>
>> For a given line you need to lex the line number first. That can be its
>> own regex.
>> Then you need to lex BASIC tokens one after another. Pretty much that's
>> just one giant regex of alternative lexemes.
>>
> As you extract them you may enter modes for handling expressions, strings,
>> REM content depending on what you're doing (tokenizer, syntax checker,
>> pretty printer, renumber, etc.) . That's your own code and variables, lex
>> doesn't really help with that anyway since it's not a parser.
>>
>
> You are correct about lex not handling expressions. The syntax for REM and
> strings in BASIC is not recursive and doesn't need a parser. That makes lex
> perfect as a BASIC tokenizer, but not so great for any of the other
> examples you listed, at least not on its own. While my tokenizer can
> ~kinda~ pack the .BA file by removing comments and whitespace, the sort of
> optimizations Brian is talking about (merging lines) or that I'm
> considering (removing lines that only contain a remark) are beyond it.
>
>
> Now if you're into trying a proper parser/context free grammar then I can
>> see using lex+yacc or equivalents, even in a higher level language.
>>
>
> I'm not at that level, yet. Maybe someday.
>
>
>
>> That said I don't know how parseable BASIC is. It's not a structured
>> language. It may be best suited to ad hoc methods.
>>
>
> You know a heckuva lot more than I do, John. I had just presumed any
> computer language was parsable.
>
> —b9
>

Reply via email to