Having completed a first larger example using Marpa/Tcl [4] the next
example I started to work on has macros and include files and thus
forces me to look into the missing support for non-continuous input,
non-sequential moves in the input, additional input, and the like, and
of course, parse events.

Before starting on that I had a look at how Marpa::R2 (MR2 later on)
is doing things, for ideas and inspiration.

This mail here is my attempt to summarize what I learned, and verify
that my mental model is good enough.

# Input first

- Spans and ranges. Ok, not very complex, nothing to say.

- MR2 expects to have the full text to be parsed available.

  => Parsing files requires them to be either read or mappied into
     memory somehow.

  * No parsing of a stream (socket, pipe, and the like).

  * Generally speaking, no incremental parsing.

- MR2 treats the input text (physical input stream = pIS) as
  immutable, in terms of its span.

  The proposed way of handling additional input [1] (like the content
  of include files and other externals) is to essentially allocate a
  much larger pIS than needed for the actual "natural input", so that
  we have space after it in the pIS where all the dynamically things
  can then go.

  This naturally requires some sort of a-priori estimation of the max
  amount of new text which can happen before parsing start. (Something
  a bit fraught with peril I suspect).

- MR2 further has a virtual input stream (vIS), essentially a span of
  the pIS. It may start as the full pIS, or be a sub-string. When
  handling parse-events `resume` can change this to an arbitrary span
  too.

- The descriptions of `read` [2] and `resume` mention

  ```
  [...] is considered successful if it reaches the end of input string, ...
  ```

  Is `end of input` here always `end of pIS`, or does it instead mean
  `end of vIS` as set by `read` or last call to `resume` ?

# And parse events

- When speaking of lexeme vs non-lexeme events I suspect that the
  latter are only about/for the G1 non-terminal symbols.

  (In my mind the non-lexeme L0 symbols are also non-lexeme, strictly
  speaking)

- While there is a lot of talk about event location, and trigger
  location, etc. practically speaking the user sees only

  * current location, always, through pos().
  * lexeme span for lexeme events, through `pause_span`.
  * lexeme span for discard events, through the event descriptor.

  And pre-lexeme events are the only case where the current location
  is at the start of the lexeme, everything else has it set to the end
  of the lexeme span.

- Looking at the set of methods for use when handling a lexeme event
  (LE), i.e
  
  - lexeme_alternative
  - lexeme_complete
  - resume

  I sort of get the model that when __no__ lexeme parse event triggers
  the system (L0 engine) automatically runs the internal equivalent of

  ```
        lexeme_alternative # for all accepted symbols
        lexeme_complete
        resume             # after the current lexeme
  ```

  to pass lexemes to the G1 engine, whereas with a LE in play the
  responsibility for calling any of these simply passes to the user
  instead, bypassing the above completely.

  Predictions:

  - Behavior: When multiple lexemes match at a span, and one of them
    triggers a LE, the other lexeme will be lost.

    For there is no accessor to get them all out of the recognizer
    (`pause_lexeme` is specified to return something arbitrary from
    the set, not the trigger lexeme, not all of them).

  - Implementation: Triggering of LE is handled in the L0 engine and
    its wrapper, likely by hooking into the completion-events for
    collection, and filtering when we know that acceptable symbols
    exist and their span.

  - Implementation: The above mental model (and predicted
    implementation) makes the statement "Lexeme SLIF parse events are
    ignored during `lexeme_read`" a trivial thing.

    We are not really "reading" a lexeme with `lexeme_read`.

    We are pushing it to the G1 engine and are very much past the
    point where the L0 engine wrapper collected and decided on LE
    handling.

    The other events can still happen because they are handled by the
    G1 engine we are pusing to. Only exception are Discard events,
    they happen completely in the L0 wrapper code and are decided on
    before/to the side of LE.

    It is actually this, of non-lexeme events likely handled by the G1
    engine vs lexeme events by te L0 engine, which convinced me that
    the `lexeme vs non-lexeme events` meant G1 non-terminals for the
    latter.

  Open question:

  - When handling an LE, is it possible to not only specify
    alternatives of a single lexeme, but also specify a __series__ of
    lexemes to use before resuming (internal scanning) ?

    I currently suspect not. Would be do-able with a lexeme side queue
    which is used over the L0 engine when a lexeme is needed, until
    empty, and then switching back.


~~~
[1] http://search.cpan.org/~jkegl/Marpa-R2-4.000000/pod/Scanless/R.pod#External
_lexemes_and_the_input_stream
[2] http://search.cpan.org/~jkegl/Marpa-R2-4.000000/pod/Scanless/R.pod#read()
[3] http://search.cpan.org/~jkegl/Marpa-R2-4.000000/pod/Scanless/R.pod#resume()

[4] json. Most work for it was not the grammar, but getting the
    underlying unicode processing correct.

-- 
See you,
        Andreas Kupries <[email protected]>
                        <http://core.tcl.tk/akupries/>
        Developer @     SUSE (MicroFocus Canada LLC)
                        <[email protected]>

EuroTcl 2018, Jul 7-8, Munich/DE, http://eurotcl.eu/
Tcl'2018, Oct 15-19, Houston, TX, USA. https://www.tcl.tk/community/tcl2018/
-------------------------------------------------------------------------------




-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to