On further study of the DFDL spec, section 14.2.2.2 makes it clear that trailingEmptyStrict tolerates non-trailing empty elements, and just absorbs the separators for them, so the behavior is not as I described.
So perhaps the right behavior by default is "anyEmpty". I still need to confirm these behaviors. The spec is not clear in this area at all, and needs corrections/errata. ________________________________ From: Mike Beckerle Sent: Friday, July 20, 2018 7:50:32 AM To: [email protected] Subject: FYI: Status of work on separators and separator suppression I have been at this problem of separators and separator suppression behavior in Daffodil for around 3 full weeks of my time now, so I wanted to explain the status of the work. The prior code-review pull-request will not be updated. I will incorporate the suggestions from it that are still relevant, but changes are very extensive since then, so I will create entirely new code review PRs when the code is ready for another review. My branch is daffodil-1919-separators now. Currently only 10 daffodil-test unit tests are failing, though I am running them in a mode where no round-trips are attempted, and unparser tests are not being run at all. Basically I haven't gotten to unparsing (other than that the scala code compiles), and won't until parsing passes all tests. Unparsing is fundamentally simpler but quite symmetric to parsing in this area of the code. All the complexity of separators and separator suppression takes place in parsing. The rest of this email details what all has been /is being done. Some variation of this material will go into the commit message eventually, some should become block comments in the code, and perhaps a wiki page. I'm not expecting detailed commentary on the below, though comments are welcome. I just wanted to write up a summary of what all has been taking so long, and why. Now I will get back to finishing what is described here, fixing the 10 parser bugs, etc. ----------------------- This is a large set of changes motivated by several things: The primary motivation is DAFFODIL-1919 - separators, and trailing suppression A secondary agenda is the need to eventually improve compilation speed and space. This a secondary, but ongoing effort. When opportunities arise to move things in the right direction here without major disruption, or in conjunction with an otherwise needed big change, then they are worth taking. Major Changes: Refactored the whole way repeating elements and optional elements work. Old way was that sequence combinators were trivial, and elements and model-groups had complex parsers that dealt with the complexities of things like whether or not to put in an infix separator. This was too decentralized to handle the full complexity of separators with dfdl:separatorSuppressionPolicy. The new organization has sequence parsers that have the sophistication in them. There are these new grammar combinators that correspond to a sequence group of the DFDL schema: OrderedSeparatedSequence OrderedUnseparatedSequence ChoiceBranchImpliedSequence - A variation on OrderedUnseparatedSequence. (Note: Some of these classes/traits may get renamed based on code-review, etc.) The DSOM base class SeqTermBase was refactored to allow this implied sequence (when a repeating element is a child of a choice, it's occurrences are in a sequence that cannot have separators. Eventually there will need to be unordered versions as well, but unordered is not yet implemented. These grammar combinators work in conjunction with individual "wrapper combinators" (combinators of exactly one child term) derived from SequenceChild. This base class represents a single declared child of a sequence. The SequenceChild objects gather information about the Term, such as whether it is scalar or repeating, has separators, is ordered, has type string or hexBinary (which require special treatment for separator suppression situations), simple or complex type, is potentially trailing, etc. (There are other code organization benefits of the SequenceChild organization mentioned below under General Improvements.) The new grammar combinators for sequences each create a Parser and an Unparser, so: OrderedSeparatedSequenceParser OrderedSeparatedSequenceUnparser OrderedUnseparatedSequenceParser OrderedUnseparatedSequenceUnparser The ChoiceBranchImpliedSequence combinator doesn't have its own parser/unparser. It creates an OrderedUnseparatedSequenceParser by way of it being a kind of ordered sequence that has no separator. There are unparser equivalents to these. These new parsers centralize the handling of points of uncertainty or PoUs for optional and repeating elements. These parsers work in conjunction with wrapper parser combinators (parser combinators that orchestrate exactly one child parser) which are derived from SequenceChildParser (and for unparsing, which is very symmetric to parsing in this area of Daffodil) SequenceChildUnparser. There is a complex of mixins and base classes for creating SequenceChildParser classes, including: SequenceChildParser (base) RepeatingChildParser (base) OccursCountExactParser (base) OccursCountExpressionParser (base) OccursCountMinMaxParser (base) >From these, specific mixins and base classes for the separated and unseparated >sequence cases include: Separated (mixin trait) Unseparated (mixin trait) RepeatingSeparatedPoU (mixin trait) RepeatingUnseparatedPoU (mixin trait) ScalarOrderedSeparatedSequenceChildParser (concrete) RepOrderedExactlyNSeparatedSequenceChildParser (concrete) RepOrderedExactlyTotalOccursCountSeparatedSequenceChildParser (concrete) RepOrderedWithMinMaxSeparatedSequenceChildParser (concrete) There are unseparated equivalents of these concrete classes, and for unparsing, a symmetric set of separated, and unseparated concrete SequenceChildUnparser-derived classes. Eventually there will be unordered equivalents of these as well. That sounds like a lot, but really there's exactly three patterns for parsing: * scalar (used for model groups and for scalar elements), * arrays with specified occurrence counts (so no speculative parsing/PoUs) - there are two flavors: ExactN known, and OccursCount expression * arrays/optionals with speculative parsing/PoUs - the code uses the abbreviation "...WithPoU" to identify this case in method names and some class/trait names, and "...WithoutPoU" for the other cases. Each of the OrderedSequenceParserBase-derived parsers basically implements a loop for each sequence child, categorizes each into one of the three above patterns, and then for arrays/optionals has a loop invoking the sequence child parser in an appropriate manner. The sequence child parsers do not iterate anything. They package the sequence child for a single call for a single instance parse. It is the sequence parser that does the iteration over the multiple occurrences. The "MinMax" sequence child parsers deal with the complexity of when minOccurs matters, and when it is just advisory (occursCountKind='parsed'), and bounded and unbounded maxOccurs. These are not distinct parsers (as they were in the old architecture - e.g., we had a RepUnbounded which was only for maxOccurs unbounded case.) Unparsers are simpler - there are no points-of-uncertainty for unparsing, so there are only 2 patterns: scalar, array/optional. (Note: I may try to replace this terminology of Array/Optional - the code uses Array a lot when it means Array/Optional. I believe the terms should be Scalar, Specified, and "Speculative" occurrences - replacing the Array/Optional terminology.) The thing to keep in mind is that the parsers derived from base OrderedSequenceParserBase are the driver combinator parsers which contain the iteration logic for all recurring/optional behavior and all point-of-uncertainty/backtracking behavior. The long code paths are in these. The SequenceChildParser classes only provide information for the driver combinator parsers to use. Logic about separators, when to insert them, etc. is all centralized only in the OrderedSeparatedSequenceParser. (Unparsing works similarly in that the sequence unparsers are the driving loops, and the sequence child unparsers are slaves which only ever unparse a single occurrance at a time.) Having explained what these new classes are, and roughly how they are organized, it's important to motivate *why* this complete flip of the architecture was necessary. Before this refactoring, we had quite simple sequence iterators. They called quite complex scalar parsers or "RepParser" combined objects which were themselves invoking parsers composed of the needed parsers for separators in a static relationship to the terms being separated by way of the grammar rules. The iteration of the sequence children was trivial in the sequence combinator parser. The complex iteration of array/optional elements was in the RepParser classes. That architecture was simply not up to the job of handling the complexity of sequences with separators with the full generality of what DFDL requires. Specifically, that old architecture assumed a RepParser had enough information to know how to parse, assuming it was just iterating a parser composed of a separator glommed together with the term separated by it. The Infix separator parser for example, knew to check the PState to determine if it was first in a group, so as not to put an infix separator before the first element of a sequence. But this organization turns out to be at cross purposes with separator suppression logic. It is probably not impossible to get it to work, but it is a very complex state management problem due to the cross-cutting nature of separator suppression logic. The particular challenge is for trailing separator suppression ie., "trailingEmpty" separator suppression policy. This really means moving past (I call it "absorbing") a separator, without creating a corresponding element in the infoset - i.e., it is the infoset content that is being suppressed really. A suppressible trailing separator is identified by finding that the parse following it (or before it in the case of postfix) either succeeded - but created an empty string/empty hexbinary - or failed when trying to convert a zero-length string into a value. When this occurs for an optional element (occursIndex <= minOccurs) and the SequenceChildParser identifies the corresponding Term as potentially trailing (based on compile-time analysis), then another entirely distinct point of uncertainty is needed. In this case, a separator being suppressed is "on trial". It is only suppressible in the "trailingEmpty" separatorSuppressionPolicy if everything that follows it is also zero-length (ZL) so that all subsequent separators are also suppressible.. That is, the "on trial" separator must be proven to be "actually trailng". Note that by "everything after" this means in the entire sequence, not just in the array element itself, but beyond that term to the sibling terms after it to the end of the sequence. If a subsequent occurrence of anything after is *not* zero-length, then the potentially trailing separator fails to be "actually trailing", and we have to backtrack to the point of suppression, restore the infoset changes that non-suppressed separator implies, and then continue parsing as if the suppression hadn't occurred. In theory this may lead to either a failure of some sort, or an array/optional element being created with empty-string/empty hexbinary as its value. Subsequently, a later separator might be identified as possibly suppressible and the speculation to determine "actually trailing" conceptually begins again. All this added complexity is on top of the "ordinary" path for separators with speculative parsing, which is already a fairly intricate and long code path. Adding this additional point of backtrack for dealing with the separator suppression, which isn't isolated to one sequence child at a time, but cuts across all subsequent sequence children's behavior, is really where the new organization into OrderedSequenceParser(s) and SequenceChildParser(s) adds value. An interesting point: dfdl:separatorSuppressionPolicy of 'trailingEmptyStrict' is kind of poorly named. This policy is really the "ordinary" behavior where no extra separators can exist for non-appearing optional elements. While it repeats the "trailingEmpty" part of the identifier, there's really nothing about trailing anything in this policy. We don't actually care if things are trailing or not in this policy. (Bug DAFFODIL-1948 is to change the default in general format to trailingEmptyStrict along with dfdl:occursCountKind default to "implicit"). The other dfdl:separatorSuppressionPolicy values are all "special cases" for particular format characteristics, and so should be selected specifically by the schema author. Policy 'never' and 'anyEmpty' are really quite simple to implement. Policy 'trailingEmpty' is the very complex one with the backtracking that cross-cuts the sequence children described above. In my view, the new architecture packages up this complexity in a far cleaner and more maintainable manner. Even excluding the trailingEmpty complexity I believe the refactored design is far superior organization. Dealing with all this uncertainty and backtracking is complex code still, but it is "ordinary code complexity", i.e., a big complex method, and not decomposed into a bunch of state-machine-like distributed interacting pieces that communicate by side-effects on PState. As an example of this, the PState.mpstate.occursBoundsStack is no longer needed, because the bound is just held in a regular method local value in the sequence parser method code. The recursive nature of parsing provides the "stack". This has the potential to reduce copying overhead for creating and restoring PStates when backtracking. Finally: a number of ambiguities in the DFDL spec were identified in the course of this work, and those have been raised for clarification/discussion on the DFDL workgroup mailing list. Other General Improvements: Removed ability for any schema component to walk backrefs to a Term. That is, the 'term' method is removed. This required cleaning up the refactoring of Term from Decl/Def. A prior commit had done most of this refactoring, but this was further cleaned up. The introduction of SequenceChild provides the split between a sharable DSOM term object, and its point of use in a sequence. The point of this split is that eventually the complex calculations that involve traversing backpointers will move from sharable terms to dedicated SequenceChild instances that reference them. E.g., both separator suppression and alignment require lots of looking around at the siblings of a Term within the surrounding model groups. Those calculations can use information taken from the shared say, global element decl, but they are actually part of the sequence that is referencing that global element decl. So this is one split between unsharable things with backpointers that can look around at siblings, and sharable things without backpointers. So far only one baby step was taken in moving a calculation from the sharable object to the unshared. The HasTermCheck mixin was introduced, and it provides a checkTerm method overridden in classes that need to perform a check. That method is passed the Term. This allows a single shared DSOM object to implement a check that requires, for example, looking at the properties of each Term, as they are fully resolved/combined from all sources (e.g., from element-ref, element-decl, simple-type-def, base-simple-type-def, and from default formats surrounding those in their schema documents) to check that a particular constraint holds. For example, in dfdl:assert with testPattern, there is a check if the pattern "\w" appears in the regex and the charset encoding is known at schema compile time to be UTF-8. A schema definition warning is issued in this case. That check formerly required backpointer navigation from the DFDLAssert statement object back to the Term that it ultimately executes on. A copy of that DFDLAssert had to be created corresponding to every term using the assert due to this backpointer. Now the check code is part of the shareable DFDLAssert object, but the Term is passed to it for the checking, avoiding the need for the backpointer. This will eventually allow us to save space by not allocating copies of sharable DSOM objects, or rather it is one more small step in that direction. The conversion of the code to not copy DFDLAssert objects is *not* done, but this particular case for the backpointer has been removed, and is a useful example. Small additional changes: * Got rid of everything about stop values and a bit more other unimplemented cruft that was cluttering the grammar rules.
