[langsec-discuss] [Erik Poll] Re: state machines vs regular expressions

Sven M . Hallberg Tue, 05 Apr 2016 05:23:18 -0700

Hi list,

I'd like to forward a discussion that arose between me and Erik Poll
(Cc'ed because he is not on this list) who presented* "session
languages" for last year's LangSec workshop.


This came from the way I implemented DNP3's transport layer [1] which is
like a much simpler form of TCP. I express the valid sequences of
messages as a regular language over a well-defined alphabet that is
generated by input events. This can be described by a regular expression
or a CFG and an implementation as an actual Hammer parser follows
naturally.

Basically, I think it's interesting to consider such simple
representations as an option for formalization and decoupling at the
protocol level.

Further comments welcome.

-Sven

PS: Attached message is the third in a sequence; start reading the
full-quote at the end.

* with Joeri de Ruiter and Aleksy Schubert

[1]: https://github.com/pesco/dnp3/blob/master/src/dissect.c

--- Begin Message ---

Hi Sven,

You raise two excellent points:
- what the "accepting states" are
- if & how to specify error handling

I think these two (related) issues explain some differences
between session languages and input 'normal' languages, and why
for session languages FSMs are somehow nicer than regular
expressions, even though they are of course equally expressive.

As a (better) example, consider

    A;B;(B|C)

either 1) as normal input language or 2) as session language

1) If this is our input language, then we simply receive one
    input message, which should be ABB or ABC.

    If the input is ABB or ABC, then we go on and process the input.
    If not, eg. if we receive ABA, then we discard the input, and
    that is the end of the story.

    Whether we use an FSM or RE to specify this language makes no
    difference whatsoever, as far as I can see.

2) Now suppose that A;B;(B|C) is a session language, where A, B
    and C are input messages we can receive.

    (There could be outputs after each of these inputs. Eg
     the whole conversation could be
      A; ok; B; ok; (B|C); done
     for some outputs ok and done, but we'll ignore that for now.)

    Handling incorrect sequences is trickier now:

    Firstly, if we now get A;B;A as input, then by the time we
    notice something is wrong - when we receive the second A - we
    will already have received and possibly processed the first
    two incoming messages. (We may have sent some responses, and
    at least we must have recorded the fact that we received A and
    then B.)  We may have to undo some of the processing of these
    first two inputs.

    Secondly, we have to be resilient and somehow cope with ANY
    input sequence. One could say the session language we have to
    cope with is effectively always
      (A|B|C)*
    where of course some input sequences should result in different
    behaviour than others.  Let's call the handling of incorrect
    sequences error handling, just to give it a name.

    In a state machine this error handling is easy to specify. Eg
    we could specify that we go to some error state, state 5 in
    the diagram below, after any deviation from what is normal:

           A       B       B|C
       1 -----> 2 ----> 3 -----> 4
       |        |       |
       |(B|C)   |(A|C)  | A
       |        |       V
       ---------------> 5

    Alternatively, we can apply Postel's Principle and simply
    ignore unexpected messages, which will result in little self-loops
    at each state.

    We could also say what happens after we get to state 4 and 5,
    but that is maybe not relevant/interesting for a given
    protocol. Or sessions may be infinite, as you can keep
    repeating them, which could be achieved by adding arrows
    labelled A back from state 4 and 5 to state 2.

    Note that the FSM written as a regular expression then simply
    ends up being (A|B|C)*, which is not so informative if we can
    not different states as we can in the FSM.

    Using a grammar makes more sense, because you can introduce
    non-terminals to capture the interesting notions, eg
      Sessions -> Session*
      Session  -> NormalSession | WrongSession
      ...

Some observations:

- For a normal input language, in case 1), we only have to make a
   binary decision about whether an input is legal or not.

   For a session language the decision is often more complicated
   than a binary yes or no.  Different forms of "no" could be
    - no, but I'll start a new conversation
    - no, but I'll simply ignore that input and continue our
      conversation
    - no, and I'll stop talking altogether

   In other words, error handling for an normal input language is
   simpler than for a session language.

- All this is related to your comment about what the accepting
   states are, and what the error or stuck states mean.

   State 4 above is probably an accepting state. But the session
   language is by definition prefix-closed: any prefix of a legal
   conversation is a legal conversation. So 1,2,3 are also
   accepting states, in a way. And state 5 above is an accepting
   state, but for the language of incorrect sessions

- If we specify outputs as well as the inputs, then we
   can describe error handling using regular expressions.
   Eg for the state machine above, this could be

      A; ok; B; ok; (B|C); done
       union  (B|C);error
       union  A;ok;(A|C);error
       union  ...

   But it is not so pretty. Maybe this also explains why your
   parser gets ugly when you include errors?

- In our models of session languages so far, we usually do
   include outputs (incl. error responses) as well as inputs.
   Here we use Mealy machines, ie FSMs where transitions are
   labelled with a pair i/o of an input and a resulting output.

   Eg, for the example above, a Mealy machine could be

             A/ok         B/ok         B|C / done
       1 ---------> 2 ----------> 3 --------------> 4
       |            |             |
       |(B|C)/error |(A|C)/error  | A/error
       |            |             V
       -------------------------> 5

   Now we could identify state 4 and 5, and still be able to
   easily distinguish right traces from error traces, as the
   former end with the output "done".

All this seems related to your comment that

> differentiating between different types of parse errors is one
> of the ugliest parts of the DNP3 parser. It is required by the spec; I
> would have much rather left it out in the way you describe.

Also, you say that

> Having a parser return information (in the theoretic sense of the word)
> about errors, i.e. distinguishing between them, invites semantics
> (meaning) where LangSec argues there should be none: on invalid input.
> In my view, error inspection should be as decoupled from input
> validation as business logic, possibly even more so.

but in our FSMs for session languages, we typically _do_ describe
of the error handling, and I would argue that this is still
syntactic, not semantic.

Of course, some errors cannot be expressed on the syntactical
level, but can only be detected once you semantically proces a
message after parsing. But some errors can be caught at the level
of syntax, and I would then also like to describe the resulting
error messages syntactically.

One deeper question I still see: back in the 70s and 80s
formalisms like CSP, CCS or other process calculi were invented
to describe interaction patterns. These might be suitable than
regular expressions, state machines, or grammars to specify the
interaction/session aspects of a protocol.  Back in the 90s some
of my colleagues were working on LOTOS as formal specification
language for protocols, which even became an ISO standard.  I
think LOTOS effectively died, but maybe there are some lessons we
can still draw from these efforts...

Cheers,
Erik




On Wed, 23 Mar 2016, Sven M. Hallberg wrote:

> Hi Erik,
>
> thanks for the good discussion points! I'll reply in-line below...
> BTW, do you mind if I forward this to the langsec-discuss list and we
> continue there?
>
>
> Erik Poll <erikp...@cs.ru.nl> writes:
>> Suppose a normal protocol run is
>>
>>     A ;  B+ ; C
>>
>> or as state machine
>>                       _ B
>>                      | |
>>         A        B   v /   C
>>     1 -----> 2 -----> 3 ------> 4
>>
>> [...]
>>
>> In the state diagram we can for example specify that receiving
>> message A in state 3 should
>> 1) be ignored, by adding a loop from 3 to 3 labelled A
>> 2) result in aborting the session, by adding a transition
>>     labelled A from state 3 to some error state
>> 3) be treated as the first message of a new session,
>>     by adding a transition labelled A from state 3 to state 1
>>
>> In a regular expression you cannot really add such info.
>
> I think that's not strictly true. Let's take your examples above:
>
> Case 1 corresponds to: A B [AB]* C
> Case 3 corresponds to: (A B+)+ C
>
> But of course we already know that there is a regular expression for
> every FSM. It is a question whether one would consider the above useful
> representations of a protocol. I feel it's worth exploring (it could be
> useful in some cases and less so in others).
>
> As for case 2, what is the meaning of entering the error state? Is it an
> accepting state? Probably not. Is it a "stuck" state? In that case I
> would argue that its purpose is served by having the machine abort upon
> receiving the token "A" in state 3. Note that a parser can tell you
> those bits of information so you can distinguish errors. Also see my
> next comment.
>
>
>> Note that when you are specifying the syntax of an individual
>> message, then the issue of trying to describe how to react to
>> incorrect messsage does not arise. If an individual input message
>> should be of the form (A;B+;C), then it should simply rejected if
>> it has any other format, and using a FSM does not have any added
>> value.  This might be a way where session languages are different
>> from input message languages.
>
> Actually, differentiating between different types of parse errors is one
> of the ugliest parts of the DNP3 parser. It is required by the spec; I
> would have much rather left it out in the way you describe. Let me
> elaborate a bit...
>
> Having a parser return information (in the theoretic sense of the word)
> about errors, i.e. distinguishing between them, invites semantics
> (meaning) where LangSec argues there should be none: on invalid input.
> In my view, error inspection should be as decoupled from input
> validation as business logic, possibly even more so. At the least it
> makes parsers easier to write (= languages easier to specify) but we
> might well imagine parser differentials, weird machines or other
> security issues arising from such extra semantics.
>
>
>> A completely unrelated question about Hammer and your work on
>> DNP3. Now your DNP3 definition is written in C++ using these
>> parser combinators, right. Is there - or could there easily be -
>> a definition in some programming language independent notation?
>> Eg some EBNF-like syntax, where EBNF is maybe extended with a few
>> primitives?
>
> There is no external definition and any such language would have to be
> extended heavily to capture the full parser. It is our view that the
> Hammer API forms an EDSL that lets us represent "the best we can do" to
> specify DNP3 in a grammatical fashion.
>
> I'm working on extracting the context-free bits of DNP3 as CFGs, which
> should find their way into a paper in the future. That will include a
> discussion of how they tie together.
>
>
>> If someone wants to implement a DNP3 parser in say Haskell, they
>> might not want to read C++ code but rather have a more abstract
>> description.  Ideally, I'd like to have the grammar is some
>> programming language-independent format, from which you can then
>> generate code in any language.
>>
>> (But maybe the whole point of
>> Hammer is that to move to a concrete programming language to get
>> programmers to use it, and deliberately avoids such a separate
>> notation?)
>
> You are right in this latter part. Moving language description into the
> target programming language is a design goal of Hammer meant to promote
> its use. If my interpretation is correct, it assumes that external
> parser generators are more burden than relief and that ease and
> equivalence of implementations is better served by pushing for simpler
> languages rather than unified specification.
>
>
> -Sven
>

--- End Message ---

_______________________________________________
langsec-discuss mailing list
langsec-discuss@mail.langsec.org
https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

[langsec-discuss] [Erik Poll] Re: state machines vs regular expressions

Reply via email to