Having completed a first larger example using Marpa/Tcl [4] the next
example I started to work on has macros and include files and thus
forces me to look into the missing support for non-continuous input,
non-sequential moves in the input, additional input, and the like, and
of course, parse events.
Before starting on that I had a look at how Marpa::R2 (MR2 later on)
is doing things, for ideas and inspiration.
This mail here is my attempt to summarize what I learned, and verify
that my mental model is good enough.
# Input first
- Spans and ranges. Ok, not very complex, nothing to say.
- MR2 expects to have the full text to be parsed available.
=> Parsing files requires them to be either read or mappied into
memory somehow.
* No parsing of a stream (socket, pipe, and the like).
* Generally speaking, no incremental parsing.
- MR2 treats the input text (physical input stream = pIS) as
immutable, in terms of its span.
The proposed way of handling additional input [1] (like the content
of include files and other externals) is to essentially allocate a
much larger pIS than needed for the actual "natural input", so that
we have space after it in the pIS where all the dynamically things
can then go.
This naturally requires some sort of a-priori estimation of the max
amount of new text which can happen before parsing start. (Something
a bit fraught with peril I suspect).
- MR2 further has a virtual input stream (vIS), essentially a span of
the pIS. It may start as the full pIS, or be a sub-string. When
handling parse-events `resume` can change this to an arbitrary span
too.
- The descriptions of `read` [2] and `resume` mention
```
[...] is considered successful if it reaches the end of input string, ...
```
Is `end of input` here always `end of pIS`, or does it instead mean
`end of vIS` as set by `read` or last call to `resume` ?
# And parse events
- When speaking of lexeme vs non-lexeme events I suspect that the
latter are only about/for the G1 non-terminal symbols.
(In my mind the non-lexeme L0 symbols are also non-lexeme, strictly
speaking)
- While there is a lot of talk about event location, and trigger
location, etc. practically speaking the user sees only
* current location, always, through pos().
* lexeme span for lexeme events, through `pause_span`.
* lexeme span for discard events, through the event descriptor.
And pre-lexeme events are the only case where the current location
is at the start of the lexeme, everything else has it set to the end
of the lexeme span.
- Looking at the set of methods for use when handling a lexeme event
(LE), i.e
- lexeme_alternative
- lexeme_complete
- resume
I sort of get the model that when __no__ lexeme parse event triggers
the system (L0 engine) automatically runs the internal equivalent of
```
lexeme_alternative # for all accepted symbols
lexeme_complete
resume # after the current lexeme
```
to pass lexemes to the G1 engine, whereas with a LE in play the
responsibility for calling any of these simply passes to the user
instead, bypassing the above completely.
Predictions:
- Behavior: When multiple lexemes match at a span, and one of them
triggers a LE, the other lexeme will be lost.
For there is no accessor to get them all out of the recognizer
(`pause_lexeme` is specified to return something arbitrary from
the set, not the trigger lexeme, not all of them).
- Implementation: Triggering of LE is handled in the L0 engine and
its wrapper, likely by hooking into the completion-events for
collection, and filtering when we know that acceptable symbols
exist and their span.
- Implementation: The above mental model (and predicted
implementation) makes the statement "Lexeme SLIF parse events are
ignored during `lexeme_read`" a trivial thing.
We are not really "reading" a lexeme with `lexeme_read`.
We are pushing it to the G1 engine and are very much past the
point where the L0 engine wrapper collected and decided on LE
handling.
The other events can still happen because they are handled by the
G1 engine we are pusing to. Only exception are Discard events,
they happen completely in the L0 wrapper code and are decided on
before/to the side of LE.
It is actually this, of non-lexeme events likely handled by the G1
engine vs lexeme events by te L0 engine, which convinced me that
the `lexeme vs non-lexeme events` meant G1 non-terminals for the
latter.
Open question:
- When handling an LE, is it possible to not only specify
alternatives of a single lexeme, but also specify a __series__ of
lexemes to use before resuming (internal scanning) ?
I currently suspect not. Would be do-able with a lexeme side queue
which is used over the L0 engine when a lexeme is needed, until
empty, and then switching back.
~~~
[1] http://search.cpan.org/~jkegl/Marpa-R2-4.000000/pod/Scanless/R.pod#External
_lexemes_and_the_input_stream
[2] http://search.cpan.org/~jkegl/Marpa-R2-4.000000/pod/Scanless/R.pod#read()
[3] http://search.cpan.org/~jkegl/Marpa-R2-4.000000/pod/Scanless/R.pod#resume()
[4] json. Most work for it was not the grammar, but getting the
underlying unicode processing correct.
--
See you,
Andreas Kupries <[email protected]>
<http://core.tcl.tk/akupries/>
Developer @ SUSE (MicroFocus Canada LLC)
<[email protected]>
EuroTcl 2018, Jul 7-8, Munich/DE, http://eurotcl.eu/
Tcl'2018, Oct 15-19, Houston, TX, USA. https://www.tcl.tk/community/tcl2018/
-------------------------------------------------------------------------------
--
You received this message because you are subscribed to the Google Groups
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.