Re: Status of work on separators and separator suppression

Mike Beckerle Fri, 20 Jul 2018 09:23:10 -0700

On further study of the DFDL spec, section 14.2.2.2 makes it clear that 
trailingEmptyStrict tolerates non-trailing empty elements, and just absorbs the 
separators for them, so the behavior is not as I described.

So perhaps the right behavior by default is "anyEmpty". I still need to confirm 
these behaviors. The spec is not clear in this area at all, and needs 
corrections/errata.

________________________________
From: Mike Beckerle
Sent: Friday, July 20, 2018 7:50:32 AM
To: [email protected]
Subject: FYI: Status of work on separators and separator suppression

I have been at this problem of separators and separator suppression behavior in 
Daffodil for around 3 full weeks of my time now, so I wanted to explain the 
status of the work.

The prior code-review pull-request will not be updated. I will incorporate the 
suggestions from it that are still relevant, but changes are very extensive 
since then, so I will create entirely new code review PRs when the code is 
ready for another review. My branch is daffodil-1919-separators now.

Currently only 10 daffodil-test unit tests are failing, though I am running 
them in a mode where no round-trips are attempted, and unparser tests are  not 
being run at all. Basically I haven't gotten to unparsing (other than that the 
scala code compiles), and won't until parsing passes all tests. Unparsing is 
fundamentally simpler but quite symmetric to parsing in this area of the code. 
All the complexity of separators and separator suppression takes place in 
parsing.

The rest of this email details what all has been /is being done. Some variation 
of this material will go into the commit message eventually, some should become 
block comments in the code, and perhaps a wiki page.  I'm not expecting 
detailed commentary on the below, though comments are welcome. I just wanted to 
write up a summary of what all has been taking so long, and why.

Now I will get back to finishing what is described here, fixing the 10 parser 
bugs, etc.

-----------------------

This is a large set of changes motivated by several things:

The primary motivation is DAFFODIL-1919 - separators, and trailing suppression

A secondary agenda is the need to eventually improve compilation speed and 
space. This a secondary, but ongoing effort. When opportunities arise to move 
things in the right direction here without major disruption, or in conjunction 
with an otherwise needed big change, then they are worth taking.

Major Changes:

Refactored the whole way repeating elements and optional elements work. Old way 
was that sequence combinators were trivial, and elements and model-groups had 
complex parsers that dealt with the complexities of things like whether or not 
to put in an infix separator. This was too decentralized to handle the full 
complexity of separators with dfdl:separatorSuppressionPolicy.

The new organization has sequence parsers that have the sophistication in them. 
There are these new grammar combinators that correspond to a sequence group of 
the DFDL schema:

    OrderedSeparatedSequence

    OrderedUnseparatedSequence

    ChoiceBranchImpliedSequence - A variation on OrderedUnseparatedSequence.

(Note: Some of these classes/traits may get renamed based on code-review, etc.)

The DSOM base class SeqTermBase was refactored to allow this implied sequence 
(when a repeating element is a child of a choice, it's occurrences are in a 
sequence that cannot have separators.

Eventually there will need to be unordered versions as well, but unordered is 
not yet implemented.

These grammar combinators work in conjunction with individual "wrapper 
combinators" (combinators of exactly one child term) derived from 
SequenceChild. This base class represents a single declared child of a 
sequence. The SequenceChild objects gather information about the Term, such as 
whether it is scalar or repeating, has separators, is ordered, has type string 
or hexBinary (which require special treatment for separator suppression 
situations), simple or complex type, is potentially trailing, etc. (There are 
other code organization benefits of the SequenceChild organization mentioned 
below under General Improvements.)

The new grammar combinators for sequences each create a Parser and an Unparser, 
so:

    OrderedSeparatedSequenceParser

    OrderedSeparatedSequenceUnparser

    OrderedUnseparatedSequenceParser

    OrderedUnseparatedSequenceUnparser

The ChoiceBranchImpliedSequence combinator doesn't have its own 
parser/unparser. It creates an OrderedUnseparatedSequenceParser by way of it 
being a kind of ordered sequence that has no separator. There are unparser 
equivalents to these.

These new parsers centralize the handling of points of uncertainty or PoUs for 
optional and repeating elements.

These parsers work in conjunction with wrapper parser combinators (parser 
combinators that orchestrate exactly one child parser) which are derived from 
SequenceChildParser (and for unparsing, which is very symmetric to parsing in 
this area of Daffodil) SequenceChildUnparser.

There is a complex of mixins and base classes for creating SequenceChildParser 
classes, including:

SequenceChildParser (base)

RepeatingChildParser (base)

OccursCountExactParser (base)

OccursCountExpressionParser (base)

OccursCountMinMaxParser (base)

>From these, specific mixins and base classes for the separated and unseparated 
>sequence cases include:

Separated (mixin trait)

Unseparated (mixin trait)

RepeatingSeparatedPoU (mixin trait)

RepeatingUnseparatedPoU (mixin trait)

ScalarOrderedSeparatedSequenceChildParser (concrete)

RepOrderedExactlyNSeparatedSequenceChildParser (concrete)

RepOrderedExactlyTotalOccursCountSeparatedSequenceChildParser (concrete)

RepOrderedWithMinMaxSeparatedSequenceChildParser (concrete)

There are unseparated equivalents of these concrete classes, and for unparsing, 
a symmetric set of separated, and unseparated concrete 
SequenceChildUnparser-derived classes.

Eventually there will be unordered equivalents of these as well.

That sounds like a lot, but really there's exactly three patterns for parsing:

* scalar (used for model groups and for scalar elements),

* arrays with specified occurrence counts (so no speculative parsing/PoUs) - 
there are two flavors: ExactN known, and OccursCount expression

* arrays/optionals with speculative parsing/PoUs - the code uses the 
abbreviation "...WithPoU" to identify this case in method names and some 
class/trait names, and "...WithoutPoU" for the other cases.

Each of the OrderedSequenceParserBase-derived parsers basically implements a 
loop for each sequence child, categorizes each into one of the three above 
patterns, and then for arrays/optionals has a loop invoking the sequence child 
parser in an appropriate manner. The sequence child parsers do not iterate 
anything. They package the sequence child for a single call for a single 
instance parse. It is the sequence parser that does the iteration over the 
multiple occurrences. The "MinMax" sequence child parsers deal with the 
complexity of when minOccurs matters, and when it is just advisory 
(occursCountKind='parsed'), and bounded and unbounded maxOccurs. These are not 
distinct parsers (as they were in the old architecture - e.g., we had a 
RepUnbounded which was only for maxOccurs unbounded case.)

Unparsers are simpler - there are no points-of-uncertainty for unparsing, so 
there are only 2 patterns: scalar, array/optional.

(Note: I may try to replace this terminology of Array/Optional - the code uses 
Array a lot when it means Array/Optional. I believe the terms should be Scalar, 
Specified, and "Speculative" occurrences - replacing the Array/Optional 
terminology.)

The thing to keep in mind is that the parsers derived from base 
OrderedSequenceParserBase are the driver combinator parsers which contain the 
iteration logic for all recurring/optional behavior and all 
point-of-uncertainty/backtracking behavior. The long code paths are in these.  
The SequenceChildParser classes only provide information for the driver 
combinator parsers to use. Logic about separators, when to insert them, etc. is 
all centralized only in the OrderedSeparatedSequenceParser.

(Unparsing works similarly in that the sequence unparsers are the driving 
loops, and the sequence child unparsers are slaves which only ever unparse a 
single occurrance at a time.)

Having explained what these new classes are, and roughly how they are 
organized, it's important to motivate *why* this complete flip of the 
architecture was necessary.

Before this refactoring, we had quite simple sequence iterators. They called 
quite complex scalar parsers or "RepParser" combined objects which were 
themselves invoking parsers composed of the needed parsers for separators in a 
static relationship to the terms being separated by way of the grammar rules. 
The iteration of the sequence children was trivial in the sequence combinator 
parser. The complex iteration of array/optional elements was in the RepParser 
classes. That architecture was simply not up to the job of handling the 
complexity of sequences with separators with the full generality of what DFDL 
requires. Specifically, that old architecture assumed a RepParser had enough 
information to know how to parse, assuming it was just iterating a parser 
composed of a separator glommed together with the term separated by it. The 
Infix separator parser for example, knew to check the PState to determine if it 
was first in a group, so as not to put an infix separator before the first 
element of a sequence. But this organization turns out to be at cross purposes 
with separator suppression logic. It is probably not impossible to get it to 
work, but it is a very complex state management problem due to the 
cross-cutting nature of separator suppression logic.

The particular challenge is for trailing separator suppression ie., 
"trailingEmpty" separator suppression policy. This really means moving past (I 
call it "absorbing") a separator, without creating a corresponding element in 
the infoset - i.e., it is the infoset content that is being suppressed really.

A suppressible trailing separator is identified by finding that the parse 
following it (or before it in the case of postfix) either succeeded - but 
created an empty string/empty hexbinary - or failed when trying to convert a 
zero-length string into a value. When this occurs for an optional element 
(occursIndex <= minOccurs) and the SequenceChildParser identifies the 
corresponding Term as potentially trailing (based on compile-time analysis), 
then another entirely distinct point of uncertainty is needed. In this case, a 
separator being suppressed is "on trial". It is only suppressible in the 
"trailingEmpty" separatorSuppressionPolicy if everything that follows it is 
also zero-length (ZL) so that all subsequent separators are also suppressible.. 
That is, the "on trial" separator must be proven to be "actually trailng".  
Note that by "everything after" this means in the entire sequence, not just in 
the array element itself, but beyond that term to the sibling terms after it to 
the end of the sequence. If a subsequent occurrence of anything after is *not* 
zero-length, then the potentially trailing separator fails to be "actually 
trailing", and we have to backtrack to the point of suppression,  restore the 
infoset changes that non-suppressed separator implies, and then continue 
parsing as if the suppression hadn't occurred. In theory this may lead to 
either a failure of some sort, or an array/optional element being created with 
empty-string/empty hexbinary as its value. Subsequently, a later separator 
might be identified as possibly suppressible and the speculation to determine 
"actually trailing" conceptually begins again.  All this added complexity is on 
top of the "ordinary" path for separators with speculative parsing, which is 
already a fairly intricate and long code path. Adding this additional point of 
backtrack for dealing with the separator suppression, which isn't isolated to 
one sequence child at a time, but cuts across all subsequent sequence 
children's behavior, is really where the new organization into 
OrderedSequenceParser(s) and SequenceChildParser(s) adds value.

An interesting point: dfdl:separatorSuppressionPolicy of 'trailingEmptyStrict' 
is kind of poorly named. This policy is really the "ordinary" behavior where no 
extra separators can exist for non-appearing optional elements. While it 
repeats the "trailingEmpty" part of the identifier, there's really nothing 
about trailing anything in this policy. We don't actually care if things are 
trailing or not in this policy. (Bug DAFFODIL-1948 is to change the default in 
general format to trailingEmptyStrict along with dfdl:occursCountKind default 
to "implicit").

The other dfdl:separatorSuppressionPolicy values are all "special cases" for 
particular format characteristics, and so should be selected specifically by 
the schema author. Policy 'never' and 'anyEmpty' are really quite simple to 
implement. Policy 'trailingEmpty' is the very complex one with the backtracking 
that cross-cuts the sequence children described above.

In my view, the new architecture packages up this complexity in a far cleaner 
and more maintainable manner. Even excluding the trailingEmpty complexity I 
believe the refactored design is far superior organization. Dealing with all 
this uncertainty and backtracking is complex code still, but it is "ordinary 
code complexity", i.e., a big complex method, and not decomposed into a bunch 
of state-machine-like distributed interacting pieces that communicate by 
side-effects on PState.

As an example of this, the PState.mpstate.occursBoundsStack is no longer 
needed, because the bound is just held in a regular method local value in the 
sequence parser method code. The recursive nature of parsing provides the 
"stack". This has the potential to reduce copying overhead for creating and 
restoring PStates when backtracking.

Finally: a number of ambiguities in the DFDL spec were identified in the course 
of this work, and those have been raised for clarification/discussion on the 
DFDL workgroup mailing list.

Other General Improvements:

Removed ability for any schema component to walk backrefs to a Term. That is, 
the 'term' method is removed. This required cleaning up the refactoring of Term 
from Decl/Def. A prior commit had done most of this refactoring, but this was 
further cleaned up.

The introduction of SequenceChild provides the split between a sharable DSOM 
term object, and its point of use in a sequence. The point of this split is 
that eventually the complex calculations that involve traversing backpointers 
will move from sharable terms to dedicated SequenceChild instances that 
reference them. E.g., both separator suppression and alignment require lots of 
looking around at the siblings of a Term within the surrounding model groups. 
Those calculations can use information taken from the shared say, global 
element decl, but they are actually part of the sequence that is referencing 
that global element decl. So this is one split between unsharable things with 
backpointers that can look around at siblings, and sharable things without 
backpointers.

So far only one baby step was taken in moving a calculation from the sharable 
object to the unshared. The HasTermCheck mixin was introduced, and it provides 
a checkTerm method overridden in classes that need to perform a check. That 
method is passed the Term. This allows a single shared DSOM object to implement 
a check that requires, for example, looking at the properties of each Term, as 
they are fully resolved/combined from all sources (e.g., from element-ref, 
element-decl, simple-type-def, base-simple-type-def, and from default formats 
surrounding those in their schema documents) to check that a particular 
constraint holds.

For example, in dfdl:assert with testPattern, there is a check if the pattern 
"\w" appears in the regex and the charset encoding is known at schema compile 
time to be UTF-8. A schema definition warning is issued in this case. That 
check formerly required backpointer navigation from the DFDLAssert statement 
object back to the Term that it ultimately executes on. A copy of that 
DFDLAssert had to be created corresponding to every term using the assert due 
to this backpointer. Now the check code is part of the shareable DFDLAssert 
object, but the Term is passed to it for the checking, avoiding the need for 
the backpointer. This will eventually allow us to save space by not allocating 
copies of sharable DSOM objects, or rather it is one more small step in that 
direction. The conversion of the code to not copy DFDLAssert objects is *not* 
done, but this particular case for the backpointer has been removed, and is a 
useful example.

Small additional changes:

* Got rid of everything about stop values and a bit more other unimplemented 
cruft that was cluttering the grammar rules.

Re: Status of work on separators and separator suppression

Reply via email to