I have been at this problem of separators and separator suppression behavior in
Daffodil for around 3 full weeks of my time now, so I wanted to explain the
status of the work.
The prior code-review pull-request will not be updated. I will incorporate the
suggestions from it that are still relevant, but changes are very extensive
since then, so I will create entirely new code review PRs when the code is
ready for another review. My branch is daffodil-1919-separators now.
Currently only 10 daffodil-test unit tests are failing, though I am running
them in a mode where no round-trips are attempted, and unparser tests are not
being run at all. Basically I haven't gotten to unparsing (other than that the
scala code compiles), and won't until parsing passes all tests. Unparsing is
fundamentally simpler but quite symmetric to parsing in this area of the code.
All the complexity of separators and separator suppression takes place in
parsing.
The rest of this email details what all has been /is being done. Some variation
of this material will go into the commit message eventually, some should become
block comments in the code, and perhaps a wiki page. I'm not expecting
detailed commentary on the below, though comments are welcome. I just wanted to
write up a summary of what all has been taking so long, and why.
Now I will get back to finishing what is described here, fixing the 10 parser
bugs, etc.
-----------------------
This is a large set of changes motivated by several things:
The primary motivation is DAFFODIL-1919 - separators, and trailing suppression
A secondary agenda is the need to eventually improve compilation speed and
space. This a secondary, but ongoing effort. When opportunities arise to move
things in the right direction here without major disruption, or in conjunction
with an otherwise needed big change, then they are worth taking.
Major Changes:
Refactored the whole way repeating elements and optional elements work. Old way
was that sequence combinators were trivial, and elements and model-groups had
complex parsers that dealt with the complexities of things like whether or not
to put in an infix separator. This was too decentralized to handle the full
complexity of separators with dfdl:separatorSuppressionPolicy.
The new organization has sequence parsers that have the sophistication in them.
There are these new grammar combinators that correspond to a sequence group of
the DFDL schema:
OrderedSeparatedSequence
OrderedUnseparatedSequence
ChoiceBranchImpliedSequence - A variation on OrderedUnseparatedSequence.
(Note: Some of these classes/traits may get renamed based on code-review, etc.)
The DSOM base class SeqTermBase was refactored to allow this implied sequence
(when a repeating element is a child of a choice, it's occurrences are in a
sequence that cannot have separators.
Eventually there will need to be unordered versions as well, but unordered is
not yet implemented.
These grammar combinators work in conjunction with individual "wrapper
combinators" (combinators of exactly one child term) derived from
SequenceChild. This base class represents a single declared child of a
sequence. The SequenceChild objects gather information about the Term, such as
whether it is scalar or repeating, has separators, is ordered, has type string
or hexBinary (which require special treatment for separator suppression
situations), simple or complex type, is potentially trailing, etc. (There are
other code organization benefits of the SequenceChild organization mentioned
below under General Improvements.)
The new grammar combinators for sequences each create a Parser and an Unparser,
so:
OrderedSeparatedSequenceParser
OrderedSeparatedSequenceUnparser
OrderedUnseparatedSequenceParser
OrderedUnseparatedSequenceUnparser
The ChoiceBranchImpliedSequence combinator doesn't have its own
parser/unparser. It creates an OrderedUnseparatedSequenceParser by way of it
being a kind of ordered sequence that has no separator. There are unparser
equivalents to these.
These new parsers centralize the handling of points of uncertainty or PoUs for
optional and repeating elements.
These parsers work in conjunction with wrapper parser combinators (parser
combinators that orchestrate exactly one child parser) which are derived from
SequenceChildParser (and for unparsing, which is very symmetric to parsing in
this area of Daffodil) SequenceChildUnparser.
There is a complex of mixins and base classes for creating SequenceChildParser
classes, including:
SequenceChildParser (base)
RepeatingChildParser (base)
OccursCountExactParser (base)
OccursCountExpressionParser (base)
OccursCountMinMaxParser (base)
>From these, specific mixins and base classes for the separated and unseparated
>sequence cases include:
Separated (mixin trait)
Unseparated (mixin trait)
RepeatingSeparatedPoU (mixin trait)
RepeatingUnseparatedPoU (mixin trait)
ScalarOrderedSeparatedSequenceChildParser (concrete)
RepOrderedExactlyNSeparatedSequenceChildParser (concrete)
RepOrderedExactlyTotalOccursCountSeparatedSequenceChildParser (concrete)
RepOrderedWithMinMaxSeparatedSequenceChildParser (concrete)
There are unseparated equivalents of these concrete classes, and for unparsing,
a symmetric set of separated, and unseparated concrete
SequenceChildUnparser-derived classes.
Eventually there will be unordered equivalents of these as well.
That sounds like a lot, but really there's exactly three patterns for parsing:
* scalar (used for model groups and for scalar elements),
* arrays with specified occurrence counts (so no speculative parsing/PoUs) -
there are two flavors: ExactN known, and OccursCount expression
* arrays/optionals with speculative parsing/PoUs - the code uses the
abbreviation "...WithPoU" to identify this case in method names and some
class/trait names, and "...WithoutPoU" for the other cases.
Each of the OrderedSequenceParserBase-derived parsers basically implements a
loop for each sequence child, categorizes each into one of the three above
patterns, and then for arrays/optionals has a loop invoking the sequence child
parser in an appropriate manner. The sequence child parsers do not iterate
anything. They package the sequence child for a single call for a single
instance parse. It is the sequence parser that does the iteration over the
multiple occurrences. The "MinMax" sequence child parsers deal with the
complexity of when minOccurs matters, and when it is just advisory
(occursCountKind='parsed'), and bounded and unbounded maxOccurs. These are not
distinct parsers (as they were in the old architecture - e.g., we had a
RepUnbounded which was only for maxOccurs unbounded case.)
Unparsers are simpler - there are no points-of-uncertainty for unparsing, so
there are only 2 patterns: scalar, array/optional.
(Note: I may try to replace this terminology of Array/Optional - the code uses
Array a lot when it means Array/Optional. I believe the terms should be Scalar,
Specified, and "Speculative" occurrences - replacing the Array/Optional
terminology.)
The thing to keep in mind is that the parsers derived from base
OrderedSequenceParserBase are the driver combinator parsers which contain the
iteration logic for all recurring/optional behavior and all
point-of-uncertainty/backtracking behavior. The long code paths are in these.
The SequenceChildParser classes only provide information for the driver
combinator parsers to use. Logic about separators, when to insert them, etc. is
all centralized only in the OrderedSeparatedSequenceParser.
(Unparsing works similarly in that the sequence unparsers are the driving
loops, and the sequence child unparsers are slaves which only ever unparse a
single occurrance at a time.)
Having explained what these new classes are, and roughly how they are
organized, it's important to motivate *why* this complete flip of the
architecture was necessary.
Before this refactoring, we had quite simple sequence iterators. They called
quite complex scalar parsers or "RepParser" combined objects which were
themselves invoking parsers composed of the needed parsers for separators in a
static relationship to the terms being separated by way of the grammar rules.
The iteration of the sequence children was trivial in the sequence combinator
parser. The complex iteration of array/optional elements was in the RepParser
classes. That architecture was simply not up to the job of handling the
complexity of sequences with separators with the full generality of what DFDL
requires. Specifically, that old architecture assumed a RepParser had enough
information to know how to parse, assuming it was just iterating a parser
composed of a separator glommed together with the term separated by it. The
Infix separator parser for example, knew to check the PState to determine if it
was first in a group, so as not to put an infix separator before the first
element of a sequence. But this organization turns out to be at cross purposes
with separator suppression logic. It is probably not impossible to get it to
work, but it is a very complex state management problem due to the
cross-cutting nature of separator suppression logic.
The particular challenge is for trailing separator suppression ie.,
"trailingEmpty" separator suppression policy. This really means moving past (I
call it "absorbing") a separator, without creating a corresponding element in
the infoset - i.e., it is the infoset content that is being suppressed really.
A suppressible trailing separator is identified by finding that the parse
following it (or before it in the case of postfix) either succeeded - but
created an empty string/empty hexbinary - or failed when trying to convert a
zero-length string into a value. When this occurs for an optional element
(occursIndex <= minOccurs) and the SequenceChildParser identifies the
corresponding Term as potentially trailing (based on compile-time analysis),
then another entirely distinct point of uncertainty is needed. In this case, a
separator being suppressed is "on trial". It is only suppressible in the
"trailingEmpty" separatorSuppressionPolicy if everything that follows it is
also zero-length (ZL) so that all subsequent separators are also suppressible..
That is, the "on trial" separator must be proven to be "actually trailng".
Note that by "everything after" this means in the entire sequence, not just in
the array element itself, but beyond that term to the sibling terms after it to
the end of the sequence. If a subsequent occurrence of anything after is *not*
zero-length, then the potentially trailing separator fails to be "actually
trailing", and we have to backtrack to the point of suppression, restore the
infoset changes that non-suppressed separator implies, and then continue
parsing as if the suppression hadn't occurred. In theory this may lead to
either a failure of some sort, or an array/optional element being created with
empty-string/empty hexbinary as its value. Subsequently, a later separator
might be identified as possibly suppressible and the speculation to determine
"actually trailing" conceptually begins again. All this added complexity is on
top of the "ordinary" path for separators with speculative parsing, which is
already a fairly intricate and long code path. Adding this additional point of
backtrack for dealing with the separator suppression, which isn't isolated to
one sequence child at a time, but cuts across all subsequent sequence
children's behavior, is really where the new organization into
OrderedSequenceParser(s) and SequenceChildParser(s) adds value.
An interesting point: dfdl:separatorSuppressionPolicy of 'trailingEmptyStrict'
is kind of poorly named. This policy is really the "ordinary" behavior where no
extra separators can exist for non-appearing optional elements. While it
repeats the "trailingEmpty" part of the identifier, there's really nothing
about trailing anything in this policy. We don't actually care if things are
trailing or not in this policy. (Bug DAFFODIL-1948 is to change the default in
general format to trailingEmptyStrict along with dfdl:occursCountKind default
to "implicit").
The other dfdl:separatorSuppressionPolicy values are all "special cases" for
particular format characteristics, and so should be selected specifically by
the schema author. Policy 'never' and 'anyEmpty' are really quite simple to
implement. Policy 'trailingEmpty' is the very complex one with the backtracking
that cross-cuts the sequence children described above.
In my view, the new architecture packages up this complexity in a far cleaner
and more maintainable manner. Even excluding the trailingEmpty complexity I
believe the refactored design is far superior organization. Dealing with all
this uncertainty and backtracking is complex code still, but it is "ordinary
code complexity", i.e., a big complex method, and not decomposed into a bunch
of state-machine-like distributed interacting pieces that communicate by
side-effects on PState.
As an example of this, the PState.mpstate.occursBoundsStack is no longer
needed, because the bound is just held in a regular method local value in the
sequence parser method code. The recursive nature of parsing provides the
"stack". This has the potential to reduce copying overhead for creating and
restoring PStates when backtracking.
Finally: a number of ambiguities in the DFDL spec were identified in the course
of this work, and those have been raised for clarification/discussion on the
DFDL workgroup mailing list.
Other General Improvements:
Removed ability for any schema component to walk backrefs to a Term. That is,
the 'term' method is removed. This required cleaning up the refactoring of Term
from Decl/Def. A prior commit had done most of this refactoring, but this was
further cleaned up.
The introduction of SequenceChild provides the split between a sharable DSOM
term object, and its point of use in a sequence. The point of this split is
that eventually the complex calculations that involve traversing backpointers
will move from sharable terms to dedicated SequenceChild instances that
reference them. E.g., both separator suppression and alignment require lots of
looking around at the siblings of a Term within the surrounding model groups.
Those calculations can use information taken from the shared say, global
element decl, but they are actually part of the sequence that is referencing
that global element decl. So this is one split between unsharable things with
backpointers that can look around at siblings, and sharable things without
backpointers.
So far only one baby step was taken in moving a calculation from the sharable
object to the unshared. The HasTermCheck mixin was introduced, and it provides
a checkTerm method overridden in classes that need to perform a check. That
method is passed the Term. This allows a single shared DSOM object to implement
a check that requires, for example, looking at the properties of each Term, as
they are fully resolved/combined from all sources (e.g., from element-ref,
element-decl, simple-type-def, base-simple-type-def, and from default formats
surrounding those in their schema documents) to check that a particular
constraint holds.
For example, in dfdl:assert with testPattern, there is a check if the pattern
"\w" appears in the regex and the charset encoding is known at schema compile
time to be UTF-8. A schema definition warning is issued in this case. That
check formerly required backpointer navigation from the DFDLAssert statement
object back to the Term that it ultimately executes on. A copy of that
DFDLAssert had to be created corresponding to every term using the assert due
to this backpointer. Now the check code is part of the shareable DFDLAssert
object, but the Term is passed to it for the checking, avoiding the need for
the backpointer. This will eventually allow us to save space by not allocating
copies of sharable DSOM objects, or rather it is one more small step in that
direction. The conversion of the code to not copy DFDLAssert objects is *not*
done, but this particular case for the backpointer has been removed, and is a
useful example.
Small additional changes:
* Got rid of everything about stop values and a bit more other unimplemented
cruft that was cluttering the grammar rules.