This is an automated email from the ASF dual-hosted git repository. mbeckerle pushed a commit to branch revert-17-master in repository https://gitbox.apache.org/repos/asf/incubator-daffodil-site.git
commit 765f56e00a221a34f44414fbea655446fdbccf25 Author: Mike Beckerle <[email protected]> AuthorDate: Mon Feb 10 18:26:20 2020 -0500 Revert "New asciidoc design notes on schema compiler space/speed issue." This reverts commit ed8a08ca591896c4bfe3c978b083993736215729. --- site/dev/aboutAsciiDoc.adoc | 89 +-- site/dev/design-notes/hidden-groups.adoc | 95 --- .../namespace-binding-minimization.adoc | 464 -------------- .../term-sharing-in-schema-compiler.adoc | 669 --------------------- 4 files changed, 47 insertions(+), 1270 deletions(-) diff --git a/site/dev/aboutAsciiDoc.adoc b/site/dev/aboutAsciiDoc.adoc index 7ec9593..971a136 100644 --- a/site/dev/aboutAsciiDoc.adoc +++ b/site/dev/aboutAsciiDoc.adoc @@ -69,9 +69,23 @@ image::http://daffodil.apache.org/assets/themes/apache/img/apache-daffodil-logo. == Example Diagrams There are numerous kinds of diagrams possible. +== GraphViz Record-Based Nodes Diagram +https://graphviz.gitlab.io/documentation/[GraphViz] is probably the most powerful of the various diagram tools, but with that power comes complexity that some of the other diagram types are able to overcome by being more restrictive. -=== PlantUML Diagram +This is an example of graphViz https://graphviz.gitlab.io/_pages/doc/info/shapes.html#record[Record-based Nodes] which are very useful for box diagrams showing data layouts when a packetdiag is too rigid. +[graphviz] +.... +digraph structs { + node [shape=record]; + struct1 [label="<f0> left|<f1> mid\ dle|<f2> right"]; + struct2 [label="<f0> one|<f1> two"]; + struct3 [label="hello\nworld |{ b |{c|<here> d|e}| f}| g | h"]; + struct1:f1 -> struct2:f0; + struct1:f2 -> struct3:here; +} +.... +=== PlantUML Diagram http://plantuml.com/[PlantUML] is a component that allows to quickly write: * Sequence diagram @@ -96,10 +110,6 @@ The following non-UML diagrams are also supported: * Mathematic with AsciiMath or JLaTeXMath notation * Entity Relationship diagram -An www.plantuml.com/plantuml/uml[online PlantUML authoring tool] is available. -Though just saving and refreshing the page in a web browser (usually they auto-refresh -after 1 second) is also pretty easy. - Here is a class diagram example: [plantuml, target="diagram-classes", format="png"] @@ -132,10 +142,39 @@ Foo1 -> Foo5 : To database Foo1 -> Foo6 : To collections .... -=== Ditaa Diagram -The http://ditaa.sourceforge.net/[DITAA] graphics are ASCII-Art converted into smoother looking drawings. They have the advantage of being visual in the text file. +=== Packet Diagram - Easier but Limited to MSBF, Left-to-Right +http://blockdiag.com/en/nwdiag/packetdiag-examples.html[Packet diagrams] are useful for any time you want to show data layouts at the level of bits, bytes, and words. It assumes bit order is _most significant bit first_ and that you want to number the bits from left to right. + +This example supplies a target name for the graphic, which allows it to be reused not just in this document, but in others also. We specify SVG as the format since there appears to be bugs in PNG format for packetdiag. +[packetdiag, target="test_packet_diagram_1", format="svg"] +.... +packetdiag { + colwidth = 32 + node_height = 24 + 0-15: Source Port + 16-31: Destination Port + 32-63: Sequence Number + 64-95: Acknowledgment Number + 96-99: Data Offset + 100-105: Reserved + 106: URG [rotate = 270] + 107: ACK [rotate = 270] + 108: PSH [rotate = 270] + 109: RST [rotate = 270] + 110: SYN [rotate = 270] + 111: FIN [rotate = 270] + 112-127: Window + 128-143: Checksum + 144-159: Urgent Pointer + 160-191: (Options and Padding) + 192-223: data [colheight = 2] +} +.... -These diagrams are quite easily drawn then cut/pasted to/from actual asciidoc digrams using an online tool called http://asciiflow.com/[asciiflow]. + +=== Ditaa Diagram +The http://ditaa.sourceforge.net/[DITAA] graphics are ASCII-Art converted into smoother looking drawings. They have the advantage of being visual in the text file, but the challenge of needing to be +laid out by hand. [ditaa] .... @@ -157,39 +196,6 @@ These diagrams are quite easily drawn then cut/pasted to/from actual asciidoc di +-----------------------------------+ .... -Note: below is the widest ditaa diagram you can draw that will fit in the line-length that -github's code-review/pull-request window can display without line-wrap. -If the lines wrap in your ditaa diagram you really lose the ability to comment on them -in their textual form in the asciidoc: - -[ditaa] -.... -+--------------------------------------------------------------------------------------------------------------------+ -| make ditaa diagrams no wider than this box | -+--------------------------------------------------------------------------------------------------------------------+ -.... - -// Here's that as a comment you can cut/paste into an asciidoc file -// -// DITAA max line length (for github) ruler -------------------------------------------------------------------------| -// - -== GraphViz Record-Based Nodes Diagram -https://graphviz.gitlab.io/documentation/[GraphViz] is probably the most powerful of the various diagram tools, but with that power comes complexity that some of the other diagram types are able to overcome by being more restrictive. - -This is an example of graphViz https://graphviz.gitlab.io/_pages/doc/info/shapes.html#record[Record-based Nodes] which are very useful for box diagrams showing data layouts when a packetdiag is too rigid. -[graphviz] -.... -digraph structs { - node [shape=record]; - struct1 [label="<f0> left|<f1> mid\ dle|<f2> right"]; - struct2 [label="<f0> one|<f1> two"]; - struct3 [label="hello\nworld |{ b |{c|<here> d|e}| f}| g | h"]; - struct1:f1 -> struct2:f0; - struct1:f2 -> struct3:here; -} -.... - == Example Block Diagram The http://blockdiag.com/en/blockdiag/[blockdiag] diagram type is for basic box/arrow diagrams. @@ -206,7 +212,6 @@ blockdiag { "very easy!" [color = "orange"]; } .... - === Sequence Diagram The http://blockdiag.com/en/seqdiag/[seqdiag] diagram type creates sequence diagrams. diff --git a/site/dev/design-notes/hidden-groups.adoc b/site/dev/design-notes/hidden-groups.adoc deleted file mode 100644 index e8f03d4..0000000 --- a/site/dev/design-notes/hidden-groups.adoc +++ /dev/null @@ -1,95 +0,0 @@ -:page-layout: page -:keywords: schema-compiler performance alignment optimization -// /////////////////////////////////////////////////////////////////////////// -// -// This file is written in AsciiDoc. -// -// If you can read this comment, your browser is not rendering asciidoc automatically. -// -// You need to install the asciidoc plugin to Chrome or Firefox -// so that this page will be properly rendered for your viewing pleasure. -// -// You can get the plugins by searching the web for 'asciidoc plugin' -// -// You will want to change plugin settings to enable diagrams (they're off by default.) -// -// You need to view this page with Chrome or Firefox. -// -// /////////////////////////////////////////////////////////////////////////// -// -// When editing, please start each sentence on a new line. -// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.] -// This makes textual diffs of this file useful in a similar way to the way they work for code. -// -// ////////////////////////////////////////////////////////////////////////// - -== Hidden Groups - -=== Introduction - -The DFDL language enables groups to be hidden, meaning not part of the logical infoset. - -This is achieved by way of _hidden group references_. -This implies that a global group definition can be referenced from some group references which are ordinary, non-hidden, group references, and from other hidden group references. -Hence, the contents of the group definition are not themselves inherently hidden or not, as the hiddenness is based on how the group is referenced. - -Of course if all uses of a particular group definition are hidden group references, then the schema compiler could provide that all elements within that group are always hidden. -This is expected to be fairly common, as often hidden groups are designed to contain the hidden reprsentation details of data that has a simpler logical appearance in the infoset. - -Nevertheless, Daffodil must accomodate that the same group definition can be used both in hidden and non-hidden manners. - -=== Static Analysis by the Schema Compiler (i.e., there is none) - -The schema compiler could provide information about whether a given term of the schema is always hidden, sometimes hidden, or never hidden. -However, the runtime overheads are anticipated to be so small that optimizations based on this information are expected to be unnecessary. -Hence, no static analysis of whether a term is always/sometimes/never hidden is performed. - -=== Runtime Principles of Operation - -The implementation takes advantage of the processor (parser/unparser) traversal of the DFDL schema components following an in-order traversal pattern, so that a stack-discpline can be used when traversing hidden-group references. No actual stack is needed, only a counter. -The details are: - -* The PState/UState contains a counter (initial value 0) of the number of hidden group references currently in the nest. -* When parsing/unparsing a hidden group ref, this counter is incremented. -When unwinding from that unparse, the counter is decremented. -No action is required for non-hidden group references. -* When an infoset element is created, the setHidden() method is called if the counter is non-zero. -* At the end of processing, the counter should be zero. - -=== Runtime Data Structures - Runtime 1 - -In Daffodil's "Runtime 1" backend, these are the runtime data structure features to support hidden group references: - -* Infoset elements have an isHidden() boolean method, and a setHidden() setter. -* The PState/UState contains the counter described above. -* The ModelGroupRuntimeData object contains an isHidden flag that is true if the term corresponds to a hidden group reference. -It is this flag that is inspected to determine if the PState/UState counter should be incremented/decremented or not. -* The increment/decrement logic is centralized in Parser.parse1() method and Unparser.unparse1() method. -These inspect the current TermRuntimeData object, and when it is a ModelGroupRuntimeData, they check the isHidden flag. -* At the end of processing where checking that the various runtime-stacks and pools are properly emptied/restored, an additional check is done to insure the hidden counter of the PState/UState has been returned to zero. -This check need only be performed on normal exits. If the processor terminates abnormally, we needn't check. -* The debugger/trace should display, as part of the state, whether the current context isHidden (counter > 0) or not. -* Parser backtracking must save/restore the counter. -* Unparser suspended UStates do not need to copy the counter value. That is, this counter need only exist in UStateMain. -** This holds because infoset elements are created as part of the state of suspensions, and those infoset elements will have setHidden() called or not depending on the counter. Hence, the counter value of a suspended UState is not needed when evaluating a suspension. - -=== Testing - -* Tests cover where the same group definition is used both hidden and non-hidden within the same schema. -* Tests cover this for hidden sequences and hidden choices. -Note that a hidden group references always uses an xs:sequence element, but the referenced group can be a choice group definition. - - -== Transition Plan From Daffodil 2.5.0 - -(Delete this section once implementation is complete.) - -The existing v2.5.0 isHidden implementation consists of code that populates a member of class ElementRuntimeData (Runtime 1) isHidden member. - -All code for populating that member, and the member itself, can be removed. Anything it calls which is used only for that purpose can be removed. - -Any code that assumes the isHidden characteristic of elements is static information must be removed. - -All access to whether an element isHidden must call the isHidden method of InfosetElement (aka DIElement) object instead of on the ERD (ElementRuntimeData) object. - -There is no conversion of existing code to a new design, as the entire old mechanism, and its underlying assumptions, are being removed, and fully replaced by the new mechanism. diff --git a/site/dev/design-notes/namespace-binding-minimization.adoc b/site/dev/design-notes/namespace-binding-minimization.adoc deleted file mode 100644 index 92ccfbb..0000000 --- a/site/dev/design-notes/namespace-binding-minimization.adoc +++ /dev/null @@ -1,464 +0,0 @@ -:page-layout: page -:keywords: schema-compiler performance alignment optimization -// /////////////////////////////////////////////////////////////////////////// -// -// This file is written in AsciiDoc. -// -// If you can read this comment, your browser is not rendering asciidoc automatically. -// -// You need to install the asciidoc plugin to Chrome or Firefox -// so that this page will be properly rendered for your viewing pleasure. -// -// You can get the plugins by searching the web for 'asciidoc plugin' -// -// You will want to change plugin settings to enable diagrams (they're off by default.) -// -// You need to view this page with Chrome or Firefox. -// -// /////////////////////////////////////////////////////////////////////////// -// -// When editing, please start each sentence on a new line. -// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.] -// This makes textual diffs of this file useful in a similar way to the way they work for code. -// -// ////////////////////////////////////////////////////////////////////////// - -== Namespace Binding Minimization - -=== Introduction - -DFDL schemas are XML schemas and so DFDL inherits the namespace system of XML and XML Schema for composing large schemas from smaller ones, for reusing schema files, and for managing name conflicts. - -A DFDL Infoset isn't necessarily represented as XML however. -Some representations won't have any ability to deal with namespaces (JSON for example), and so Daffodil will sometimes issue warnings when compiling a schema if the namespace usage will not allow unambiguous representation without namespaces. - -Most representations of DFDL Infosets will, like XML, use some representation of the namespaces of elements, and in textual forms this will most commonly be by way of namespace prefixes. -XML is not the only representation that uses namespaces, however, so this should not be taken as an entirely XML-specific discussion. - -There are these goals for namespace-binding minimization. - -. Clarity: Infosets that have redundant namespace bindings are very hard to read and understand, and require namespace-binding-aware tooling to compare them, or clumsy post-processing to remove the excess bindings. - -. Performance: Attaching an element to the infoset at runtime should take constant time. - -. Consistency: The prefix-to-namespace bindings used should be drawn from those expressed on the DFDL schema by the schema author, and the prefixes used and bindings introduced when an element is attached to the infoset should be consistent with the set of namespace prefix definitions in place at the point where the element's declaration lexically appears in the DFDL schema. - -These goals are in some tension. -Consider 4 elements named A, B, C, and Q. -Suppose element A contains element B, which contains element Q. -Suppose elsewhere in the same infoset element A contains element C which contains element Q. -From the perspective of element Q, the set of namespace bindings surrounding it are those from (A, B) or those from (A, C). -Suppose element Q requires, and introduces, a namespace with prefix "qns" bound to namespace "urn:Q_Namespace". -Suppose element C also introduces this same namespace binding. -Then when element Q appears inside element B, its namespace binding for "qns" is needed. -But when element Q appears inside element C, its namespace binding for "qns" is redundant with one already provided by element C. - -The conclusion is that the minmal set of namespace bindings introduced by an element depends on the nesting of elements. - -The basic technique is to store, on the runtime element data structure (DPathElementCompileInfo), the complete set of lexical namespace bindings present for the element declaration in the DFDL schema document. - -==== Namespace Bindings come from the Element Declarations, not Element References - -Consider two schema files: - -```xml -<!-- file foo.dfdl.xsd --> -<schema - xmlns:pre1="namespace1" - xmlns:ns1="differentNamespace"> - <import namespace="namespace1" schemaFileLocation="bar.dfdl.xsd"/> - ... - <element name="root"> - <complexType> - <sequence> - <element ref="pre1:foo" maxOccurs="unbounded"/> - </sequence> - </complexType> - </element> - -</schema> - -<!-- file bar.dfdl.xsd --> -<schema targetNamespace="namespace1" - xmlns:ns1="namespace1" - xmlns:pre1="someOtherNamespace"> - - <element name="foo" ..../> -</schema> -``` -In the above we have a conflict over the use of the prefix "pre1". -Now consider an XML document corresponding to this with element 'foo' inside the 'root' element: - -```xml -<root xmlns:pre1="namespace1" - xmlns:ns1="differentNamespace"> - ... - <ns1:foo - xmlns:ns1="namespace1" - xmlns:pre1="someOtherNamespace"> - ... - </ns1:foo> - ... -</root> -``` - -Notice that element 'foo' appears inside 'root' using the "ns1" prefix but it also introduces a binding for that prefix which supercedes that of the enclosing environment. -The prefix "pre1" cannot be used for element 'foo' because in the namespace bindings of the bar.dfdl.xsd schema document, the "pre1" prefix is bound to "someOtherNamespace". - -This example illustrates that each element must use, and introduce, the lexically defined prefixes from the point where the element is declared. -Not from the point of element reference. - -Since element 'foo' is recurring, it's unfortunate, but every single instance will, textualized, carry these namespace bindings. -E.g., - -```xml -<root xmlns:pre1="namespace1" - xmlns:ns1="differentNamespace"> - ... - <ns1:foo - xmlns:ns1="namespace1" - xmlns:pre1="someOtherNamespace"> - ... - </ns1:foo> - <ns1:foo - xmlns:ns1="namespace1" - xmlns:pre1="someOtherNamespace"> - ... - </ns1:foo> - <ns1:foo - xmlns:ns1="namespace1" - xmlns:pre1="someOtherNamespace"> - ... - </ns1:foo> - - ... -</root> -``` - -This problem is not one Daffodil strives to solve. -The schema author can simply avoid these sorts of name clashes and this problem will not occur. -Automatic renaming of prefixes to avoid this problem is considered unwarranted, as it will confuse users. - - - -=== Namespace Minimization - -==== Only Element Namespace Prefix Bindings - -Only namespace definitions associated with element declarations need to ever be considered for the infoset. -Namespace definitions that define prefixes used for type, group, format, or escapeScheme references are not included -in the namespace definitions carried on infoset elements. - -==== Avoid Prefix "tns" (or Other Common Ambiguous Names) When Possible - -Many DFDL schemas will define prefix "tns" to be that schema document's target namespace. - -This same problem could occur for other prefixes. The "tns" convention is just a common one. - -If the prefix "tns" is ambiguous across the schema set (also used by other schema documents, but for different namespaces), -then its use is undesirable. - -If a schema document defines both "tns" and other prefixes for the target namespace, then another prefix is preferred for -use as the prefix of elements created from declarations in that schema document. - -This situation arises commonly for the default namespace (no prefix, defined by `xmlns="namespaceURI"`). If -this is ambiguous across the schema set (highly likely), then an available alternative prefix (from that schema document) -is preferred. -There is actually no difference between using "tns" and the default namespace. Both are just commonly used, and frequently ambiguous across the schema set. - -(This all generalizes to any prefix which is ambiguous across the schema set.) - -==== Corner Cases - -There are numerous ways schema authors can use and reuse namespace prefixes that can lead to cluttered infosets. - -Other than minor heuristics to choose among alternative available prefix definitions, Daffodil does not try to improve on the -namespace prefix problem on behalf of schema authors. - -===== No Prefix At All -A legal schema document can define a target namespace and define no prefix for it at all. - -In this case, the only way elements of that schema document can be used is some other schema document must provide a prefix definition. -Daffodil chooses a prefix from those available in the schema set (deterministically - e.g., shortest prefix, with ties resolved by alphanumeric order, avoiding ambiguous prefixes like "tns"). - -CAUTION: TBD: Should we issue a warning or even make this a schema definition error? - -===== Only "tns" or Only the Default Namespace -A schema defines "tns" (or other very common prefix like "pre" or "p") for its target namespace, but defines no other prefix that can be used as an alternative. - -Daffodil does nothing here to improve on the situation where there will be many inner namespace re-bindings of the "tns" like: - -```xml -<tns:foo xmlns:tns="namespace1"> - <tns:bar xmlns:tns="namespace2"> - <tns:quux xmlns:tns="namespace3"> - ... - </tns:quux> - </tns:bar> -</tns:foo> -``` - -This sort of thing can happen if schema authors make extensive use of the default namespace (no prefix). -For example, a schema document can define a target namespace, then define that namespace to be the default, with no other namespace prefix defined. -In that case you can have infosets like this: - -```xml -<foo xmlns="namespace1"> - <bar xmlns="namespace2"> - <quux xmlns="namespace3"> - ... - </quux> - </bar> -</foo> -``` - -==== Undefining the Default Namespace - -Many schemas will not define the default namespace. - -If an element is defined in a schema which defines the default namespace to a URI, and nested with that element are other elements from schemas that -do NOT have a definition for the default namespace, then if there are unqualified names in the latter schema that are supposed to be in _no namespace_, -the default namespace must be explicitly undefined. - -Consider two schema files: - -```xml -<!-- file foo.dfdl.xsd --> -<xs:schema - xmlns="default1" - xmlns:ns2="namespace2" > - - <xs:import namespace="namespace2" schemaFileLocation="bar.dfdl.xsd"/> - ... - <xs:element name="root"> - <xs:complexType> - <xs:sequence> - <xs:element ref="ns2:foo" maxOccurs="unbounded"/> - </xs:sequence> - </xs:complexType> - </xs:element> - -</xs:schema> - -<!-- file bar.dfdl.xsd --> -<schema targetNamespace="namespace2" - xmlns:ns2="namespace2" - elementFormDefault="unqualified" - xmlns="http://www.w3.org/2001/XMLSchema"> <!-- default namespace used for schema --> - - <element name="foo"> - <complexType> - <sequence> - <element name="bar" .../><!-- no namespace --> - </sequence> - </complexType> - </element> -</schema> -``` -In this case, an instance of root, containing 'foo' containing 'bar' requires: -```xml -<root - xmlns="default1" - xmlns:ns2="namespace2"> - <ns2:foo> - <bar xmlns=""> <!-- undefine default --> - ... - </bar> - </ns2:foo> -</root> -``` -This undefine shown above is needed even though the bar.dfdl.xsd schem has a default namespace definition, because the local element names are in no-namespace. -The default namespace in bar.dfdl.xsd is actually not used for reference to elements. -So it is tantamount to not having the default namespace defined in bar.dfdl.xsd at all. - -CAUTION: Some of this minimization may happen upon conversion to XML, and may happen automatically depending on XML libraries. -That is, an element with no namespace displayed in a context which has a default namespace definition may automatically insert `xmlns=""`. - -== Namespace Binding Minimization Algorithm - -The technique described here assumes that one needs to render the infoset to XML text using standard printing, i.e., using no special XML library. -Hence, every namespace prefix binding must be explicitly represented in the output XML text. - -The basics are: - -. For every element declaration, capture the lexical namespace scope (`scala.xml.NamespaceBinding`) from its element declaration XML in its schema document. -Save this on the runtime data structure for the element. In Runtime 1, this would be the DPathElementCompileInfo. (This is longstanding functionality in Daffodil since before version 1.0.0) - -. Excepting on the Root, remove any namespace binding that is unambiguous across the schema, and which appears on the root. - -. For each element declaration, the remaining namespace bindings and assigned prefix to be used are assigned based on the minimization rules describe above (e.g., about avoiding "tns" when possible.) - -That is all that is done at schema compile time and at parse time up to the point where a textual representation (such as XML) needs to be output. - -The DFDL Infoset tree is constructed with InfosetElement nodes that point to this compile time DPathElementCompileInfo structure, and no processing of namespace bindings occurs. - -However when converting an infoset element into XML examine the namespace bindings of the element and those of the enclosing parent element. - -.. Any that are redundant across the two are dropped. -.. New definitions introduced by the child are output as bindings -.. Redefinitions are output as bindings -.. If the element has no namespace, and the parent (or any super-parent) has a default namespace binding, then add an undefine binding for the default namespace. - -This algorithm requires non-constant-time (worst case) processing at runtime; however, there is no overhead unless there are ambiguities among the namespace bindings and when namespace bindings at nodes beneath the root are required. -In addition, the number of such cases in any _real_ schema will be small, so the algorithmic complexity worst-case here is far less important than the constant factor here. -Attaching an element to the infoset is a common operation. -These namespace binding machinations have the potential to be equally costly, per binding, to the general overhead of attaching the infoset element node. - -Our standard design principle is, however, to not worry about overheads like this which are often not going to occur in real schemas, unless performance profiling shows them to be a hot-spot. - -Sensibly-designed schemas will have no overhead from this namespace-binding combining. - -=== Converting the DFDL Infoset to XML in One Pass (Streaming) - -Note that eliminating prefix definitions that are unused in a particular XML document is not compatible with streaming. -It requires two passes to determine if a prefix is ever used to decide whether it can be omitted or must be included. - -The only alternative to this is to introduce new namespace prefix definitions only at their point of use. -That would, however, be inconsistent with our goal of clarity and avoiding namespace prefix clutter in the schema. - -It is preferable to output extra namespace bindings on the root element than to litter the document with namespace bindings at -interior XML elements. - -Daffodil aspires to streaming parsing and unparsing. A streaming parser will output parts of the infoset without waiting to -know if children will eventually appear that require the namespace prefix definitions. -As a result, all namespace prefix definitions which _may_ be required are included. -Most commonly this will result in extra unused namespace prefix definitions having been output on the start tag of the root element. - -=== API XML-Fragment Mode - For Clarity: Avoid Namespace Bindings on the Root - -When using the message streaming API and converting the parse Infoset to XML, each message is created as XML text by the parser and associated InfosetOutputter, and converting one relatively small message to XML may result in far more characters used to represent the namespace bindings on the root of the message than the rest of the message occupies in XML text. - -The API provides a method to enable XML-fragment mode. -In this mode a method can be called to retrieve namespace prefix bindings that would appear on the root (i.e, on the root XML element of each message) if a complete XML data document were being created. -Subsequent calls to parse in XML-Fragment mode create XML which has no namespace bindings on the root element of each message. -This is an XML fragment because it lacks the namespace bindings needed for it to be a complete XML document. -This is in effect leaving it up to the caller whether and when to append the namespace bindings to the XML text. - -This option is provided because the caller may not want to construct complete XML documents, and the namespace bindings in use for the XML may be well-known by the processing system. - -This is less a performance optimization (XML text is really verbose, and this optimization is only scratching the surface). -This feature is about clarity and coping with XML when using it for small data documents corresponding to small communications messages. -Small XML data documents can be overwealmed by the volume of namespace definitions. -This is particularly likely if they are for small data messages created from large DFDL schemas with many schema documents and many namespaces. - -As an example consider: -```xml - -<gn:Feature xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"><rdfs:isDefinedBy>sws/3/about.rdf</rdfs:isDefinedBy><gn:name>Zamīn Sūkhteh</gn:name><gn:alternateName> [...] -``` -This is almost impossible to understand, given that the first 1/3 of it is just namespace bindings. - -Without the namespace bindings it is easier. It looks like: -```xml -<gn:Feature><rdfs:isDefinedBy>sws/3/about.rdf</rdfs:isDefinedBy><gn:name>Zamīn Sūkhteh</gn:name><gn:alternateName><lang>fa</lang><name>Zamīn Sūkhteh</name></gn:alternateName><gn:alternateName><lang>fa</lang><name>زمين سوخته</name></gn:alternateName><gn:featureClass>ontology#S</gn:featureClass><gn:featureCode>ontology#S.CRRL</gn:featurecode><gn:countryCode>IR</gn:countryCode><wgs84_pos:lat>32.45831</wgs84_pos:lat><wgs84_pos:long>48.96335</wgs84_pos:long><gn:parentFeature>sws/3202991/</gn: [...] -``` - -With line endings after each element end tag, it is quite easy to understand. -```xml -<gn:Feature><rdfs:isDefinedBy>sws/3/about.rdf</rdfs:isDefinedBy> -<gn:name>Zamīn Sūkhteh</gn:name> -<gn:alternateName><lang>fa</lang><name>Zamīn Sūkhteh</name></gn:alternateName> -<gn:alternateName><lang>fa</lang><name>زمين سوخته</name></gn:alternateName> -<gn:featureClass>ontology#S</gn:featureClass> -<gn:featureCode>ontology#S.CRRL</gn:featurecode> -<gn:countryCode>IR</gn:countryCode> -<wgs84_pos:lat>32.45831</wgs84_pos:lat> -<wgs84_pos:long>48.96335</wgs84_pos:long> -<gn:parentFeature>sws/3202991/</gn:parentFeature> -<gn:parentCountry>sws/130758/</gn:parentCountry> -<gn:parentADM1>sws/127082/</gn:parentADM1> -<gn:nearbyFeatures>sws/3/nearby.rdf</gn:nearbyFeatures> -<gn:locationMap>3/zamin-sukhteh.html</gn:locationMap> -</gn:Feature> -``` - -The XML Infoset Inputter also has a feature allowing an API method to supply the root-level namespace bindings once, not on the root element of every XML-fragment delivered for unparsing. - -The symmetry of the API insures that one can unparse the XML output from a parse that is creating XML in this fragment mode. - -The Daffodil CLI has an option to add XML-fragment mode to message streaming behavior for parsing and unparsing. - -== Transition Plan from Daffodil 2.5.0 - -(Delete this section once implementation is complete.) - -The existing design in Daffodil 2.5.0 assumes that every element declaration is unique, not shared, and so the entire path from that element's declaration object to the root element is well known and unique. - -=== QNames and Name Resolution - - -The QNameBase.scala and ResolvesQNames traits do not need modification, but their function needs to be understood to know what *does* have to change. - -Below shows the classes used for reprsenting the names of named things in a DFDL schema, and for referring to named things in a DFDL schema: - -[plantuml] -.... - -abstract class QNameBase { - "shared implementation" -} -abstract class NamedQName { -def matches(RefQNameBase) -} -abstract class RefQNameBase { -def matches(NamedQName) -} -class RefQName -class StepQName -class LocalDeclQName -class GlobalQName -NamedQName -up-|> QNameBase -RefQNameBase -up-|> QNameBase -LocalDeclQName -up-|> NamedQName -GlobalQName -up-|> NamedQName -RefQName -up-|> RefQNameBase -StepQName -up-|> RefQNameBase -RefQNameBase -right-> NamedQName : "ref" - - -.... - -We distinguish names given to objects by their definitions/declarations from names referring to those by way of the NamedQName and RefQName distinction. -A NamedQName is created for a named thing. A RefQName is created for points of reference to it. The two can never be confused because you can't create a RefQName from a declaration/definition (that would create a NamedQName), and you cannot create a NamedQName from the "ref" property (that will create a RefQName). Furthermore, you can't test two NamedQNames to see if they match, you have to test if a RefQName refers to a NamedQName. - -All these objects are created via methods on the singleton QName object. - -==== ResolvesQNames trait - -This trait provides the resolveQName(qnString: String) : RefQName method which calls the QName.resolveRef() method supplying the necessary namespace binding information from the XML of the schema component that is needed to resolve QName prefixes to specific namespaces. - -This is done by wau of the public namespaces member. The scala.xml.Node class has a scope member which returns type NamespaceBinding. While singular, this is not a single binding, but a chained data structure where a first namespace binding object is chained to a second and subsequent one. This not an ordinary Seq() style collection. Hence the ResolvesQNames trait provides member: - -```scala -val namespaces = xml.scope -``` - -=== ElementBase Changes - -The existing ElementBase class has members, all of which are subject to revision/removal. - -* Outputs -** thisElementsNamespace - target namespace of defining schema or no-namespace if a local element decl with elementFormDefault = "unqualified", or no target namespace. -** thisElementsNamespacePrefix - This is the prefix that will be used for this element when converting to XML. This will change to be different from the provided namespace prefix only if we choose a non-"tns" equivalent, or generate a namespace prefix for particular cases. -** thisElementsRequiredNamespaceBindings - This will be removed. This can't be statically computed like this anymore. -** protected minimizedScope - This will be removed or made optional. Ths can't always be statically computed like this anymore. This will be computed instead at runtime. - -* Minimization algorithm machinery - these members are likely no longer needed. What they were computing statically and probably very inefficiently needs to be done at runtime (sometimes), and far more efficiently. -** private emptyNSPairs -** private myOwnNSPairs -** private myParentNSPairs -** private myUniquePairs -** private def pairsToNSBinding -** private parentMinimizedScope - -=== Runtime 1 changes - -* DPathElementCompileInfo -** Cleanup: `val name: String` should be removed as a constructor argument and be obtained from the namedQName via `def name = namedQName.local` - -* ElementRuntimeData -** namespaces - should stay the same. -** targetNamespace - should stay the same -** targetNamespacePrefix - computation of this may/will change in case of, e.g., choosing one that is not "tns" or other ambiguous prefix. -** thisElementsNamespacePrefix - computation of this may/will change similarly to avoid "tns" or other undesriable/ambiguous prefix. -** minimizedScope - likely changes per the "namespace binding minimization algorithm" above. Should be renamed to reflect proper usage as a statically computed part of namespaces that must be incorporated at runtime when an infoset element is attached to a parent. The new name might be "nonRootMovedScopes" or soemthing to reflect that these are bindings which could not be statically just be relocated onto the root element. This will be a primary input to the new runtime algorithm that a [...] diff --git a/site/dev/design-notes/term-sharing-in-schema-compiler.adoc b/site/dev/design-notes/term-sharing-in-schema-compiler.adoc deleted file mode 100644 index a8d54a5..0000000 --- a/site/dev/design-notes/term-sharing-in-schema-compiler.adoc +++ /dev/null @@ -1,669 +0,0 @@ -:page-layout: page -:keywords: schema-compiler performance alignment optimization -// /////////////////////////////////////////////////////////////////////////// -// -// This file is written in AsciiDoc. -// -// If you can read this comment, your browser is not rendering asciidoc automatically. -// -// You need to install the asciidoc plugin to Chrome or Firefox -// so that this page will be properly rendered for your viewing pleasure. -// -// You can get the plugins by searching the web for 'asciidoc plugin' -// -// You will want to change plugin settings to enable diagrams (they're off by default.) -// -// You need to view this page with Chrome or Firefox. -// -// /////////////////////////////////////////////////////////////////////////// -// -// When editing, please start each sentence on a new line. -// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.] -// It is acceptable to start each sentence on a new line, but then wrap the lines to a reasonable length. -// It's the starting on a new line that matters. -// -// This makes textual diffs of this file useful in a similar way to the way they work for code. -// -// ////////////////////////////////////////////////////////////////////////// - -== Term Sharing in the Schema Compiler - -=== Introduction - -The DFDL language has a composition property known as _referential transparency_. -It is not supposed to matter whether you lift a part of the schema out -and create a reusable group, reusable type, or reusable element from -it. - -Because DFDL (version 1.0) does not allow recursive definitions of any -kind, it is theoretically possible to implement this by treating all -reusable definitions in the language like macros. -I.e., inline-expand all definitions at their point of use. -This will work for schemas up to some size. However, it leads to an -unacceptable expansion in schema compilation time and space, as the -size of the schema once all these inline substitutions have been -performed can be exponentially larger than the original -non-substituted schema. - -To avoid this undesirable combinatorial explosion, the Daffodil schema -compiler arranges to achieve the same macro-like semantics, while -still sharing the core parts of the reusable components so they need -only be compiled once for each unique way the component is used. - -This is also part of eventually enabling an extension of the DFDL -language to allow recursive definitions. - -The dfdl:alignment property is one of the largest contributors to -complexity in sharing objects by the schema compiler. - -Alignment will be used as the example in this design note to -illustrate the schema compiler behavior. - -=== DSOM Term Objects - -DSOM is the Daffodil Schema Object Model. -Within this model, the components from a DFDL schema that actually have -representations in the data stream or are computed, are called _terms_ and DSOM models them as -subclasses of the class/trait Term. -Terms which are computed (via dfdl:inputValueCalc) are said to have _no representation_ in the data stream. - -For this discussion we are concerned only with _concrete_ terms, meaning those that do have representation. - -The classes in Daffodil that are used for concrete terms are: - -* ElementRef -* Root (A degenerate element ref to the root element) -* LocalElementDecl -* the quasi-elements: - -** PrefixLengthQuasiElementDecl (Fake local element decl used for the - length fields of Prefixed-length types) - -** RepTypeQuasiElementDecl (Fake local element decl used as the - representation of elements that are computed via the dfdl:repType - mechanism.) - -* Sequence -* SequenceGroupRef - -* ChoiceBranchImpliedSequence (The sequence that is inserted when a - choice branch is a single element decl/ref) - -* Choice -* ChoiceGroupRef - -A https://cwiki.apache.org/confluence/display/DAFFODIL/DFDL+Schema+Object+Model+%28DSOM%29+with+UML[UML Diagram] is available showing these classes. - -=== Term Representation Regions - -Consider the representation of a term, which we call the term's -_region_, as appearing between two other term regions in the data -stream as illustrated below: - -[ditaa] -.... ----+ +-----------------------------------------------+ +--- - | | term | | - | | | | - | | +---------------+ +--------+ +--------------+ | | -...| | | unique before | | shared | | shared after | | |... - | | +---------------+ +--------+ +--------------+ | | - --+ +-----------------------------------------------+ +-- -.... -In the above, lower position bits are to the left. -Higher position bits to the right. -In the diagram, we see that the term's region consists of 3 -sub-regions, named _unique before_, _shared_, and _shared after_. Each -term appears in a context of possible additional terms before it and -after it which are shown with ellipsis above. - -A term can be an element or a model-group (sequence or choice). -By far the most common situation is that a term appears inside a model group. -There are two exceptions. The _root_, and the model-group of a complex type. - -The core idea here is that we are separating the representation of a -term into unique and shared parts. -The unique region is affected by the surrounding model group context. -The shared regions are, in many cases, sharable across instances of -this term and is independent of the surrounding model-group context. -So if the term was defined by way of a reusable global definition, -then the goal is that there need be only one implementation of the -shared part(s). - -There are some limits to this sharing. -A single implementation is not always achievable, but generally one or -a small number of implementations are possible. - -=== Limits to Sharing - -Consider if this term above was defined by way of a XSD group -reference. -The DFDL properties expressed on the group reference must -be combined with those expressed on the group definition, to form the -complete set of properties associated with the term. -This combining creates situations where a given shared group definition can have -wildly different representations for different uses. -Consider this: - -```xml -<group name="g"> - <sequence> - <element name="a" type="xs:int"/> - <element name="b" type="xs:int"/> - <element name="c" type="xs:int"/> - </sequence> -</group> - -.... - -<element name="e"> - <complexType> - <sequence> - ... some stuff here ... - <group ref="tns:g" dfdl:separator="|"/> <!-- separated --> - ... more stuff here ... - <group ref="tns:g"/> <!-- not separated --> - ... yet more stuff here ... - </sequence> - </complexType> -</element> -``` - -The two uses of the group definition for 'g' are wildly different, in -that one has separators, the other does not. - -This is a rather extreme example, but DFDL implementations must allow -for this. - -Experience with DFDL schemas to-date (2020-01-21) is that this is very -rare and appears only in test cases designed to exercise it. -However, less extreme versions of this are possible, such as different -uses all being separated and delimited textual data, but where the -distinct uses do not all use the same separator characters, so the -separator to be used would be specified differently on each group -reference. Another expected variation could be separated sequence -groups where some uses are infix separator and others are prefix or -postfix separator. - -Anecdotally, the vast majority of schemas seen to-date reuse groups -entirely, that is specifying no properties at all on the group -reference. - -==== Property Environment or PropEnv - -We define the _property environment_ or _PropEnv_ of a term, as the -complete set of properties for that term, combining them from all the -schema components that define the term. This can include element -references, group references, local or global element declarations, -global group definitions, and local and global type definitions. - -A key observation is that the PropEnv of a term is entirely defined by: - -* local properties (including any dfdl:ref property) on the schema component for the term. -** E.g., for an element reference, the local properties expressed on the element reference itself. -* default format object at top level of the schema where the term lexically appears. -* definition object being referenced. -** E.g., for an element reference, the global element declaration being referenced. - -So if two terms, say group references, have the same PropEnv, then we -can share much about their definitions when implementing them. - -Hence, when constructing the Daffodil schema compiler's representation -for a term, we can maintain a cache for each global definition, with -the key of the cache being the PropEnv. -When a given term uses a global definition, if the point of use has -the same PropEnv as another, both can share the part of the term that -is context independent, that is, the shared region. - -The actual components of the PropEnv of a term are slighly more -complicated than described above. -The table below gives the components of the PropEnv for the various -kinds of terms. Note that the "local properties" below includes any -dfdl:ref property if it appears, but equality of any property which -has as its value a QName, like dfdl:ref, the property value is based -on a resolved QName, not the QName string. - -[cols="2,6a"] -.PropEnv Components -|=== -|Term Definition (one PropEnv subtype per) | PropEnv Components - -|Local Element Decl with type reference -| -* Local properties expressed on the Local Element Decl (Set of property name + value pairs) -* Lexically enclosing default format (object ref) -* Type definition (object ref) or primitive type (object ref) -| Local Element Decl with anonymous Simple Type Definition -| -* Combined set of local properties expressed on the Local Element Decl and the anoymous Simple Type definition. -* Lexically enclosing default format (object ref) -* Base simple type definition (object ref), or primitive type (object ref) -|Local Element Decl with anonymous complex type definition -| -* Local properties expressed on the local element decl -* Lexically enclosing default format (object ref) -* Anonymous complex type (object ref) -|Element Reference to Global Element Decl -| -* Local properties expressed on the element reference -* Lexically enclosing default format (object ref) -* Global Element Decl (object ref) -|=== - -Lookups by PropEnv compare sets by equality of members, and object -references by pointer equality i.e, eq comparison. - -This definition of PropEnv can be improved by memoizing the -construction of default format objects, so that equivalent default -formats from different schema documents are represented by the same -object. -That is, so that multiple schema documents each containing: -```xml - <dfdl:format ref="prefix:formatName"/> -``` -are instantiated as the same object when the QName resolves to the same format object. - -==== Details of Unique and Shared Regions - -The division of the reprsentation into the unique part which appears -before the shared parts indicates where we can share implementation, -and where we must have unique context-specific implementation for the -term. - -To understand this better, the below breaks down the sub-regions of a term further. -First we look at the details of the unique-before region: - -// -// DITAA max line length (for github) ruler -------------------------------------------------------------------------| -// - -[ditaa] -.... - +--+ +-----------------------------------------------+ +--+ - | | term | | - | | | | - | | +---------------+ +--------+ +--------------+ | | - ...| | | unique before | | shared | | shared after | | |... - | | +---------------+ +--------+ +--------------+ | | - +-+ +-----------------------------------------------+ +-+ - | | - | | -+---------------------------+ +----------------------------------+ -| | -v v -+------------------------------------------------------------------------------+ -| +-------------------+ unique before | -| | sequence before | | -| | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+ | -| | |prefix| |prefix| | |lSkip| |alignFill| |init'r| |init'r| |prefix| +value| | -| | |infix | |infix | | | | | | | MTA | | | |length| | MTA | | -| | |sep'r | |sep'r | | | | | | | | | | | | | | | -| | | MTA | | | | | | | | | | | | | | | | | -| | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+ | -| +-------------------+ | -+------------------------------------------------------------------------------+ - -.... -The sub-regions within the unique before region are: - -* prefix infix sep'r MTA - mandatory text alignment for a separator in infix or prefix position. -When a separator is defined, this region is populated with up to 7 bits to -obtain text alignment for the characters of the separator in the specified charset encoding. -Note that this encoding and that of the term's initator/terminator are not necessarily the same encoding. -* prefix infix sep'r - separator when defined and in prefix or infix position. -This consists of text characters. - -* lSkip - leading skip region. -This is fixed length (commonly 0) -* alignFill - alignment fill region. -This dynamically sized region depends on the bit position where it starts. -* init'r MTA - mandatory text alignment for initiator. -When an initiator is defined, this region is populated with up to 7 bits to -obtain text alignment for the characters of the initiator in specified charset encoding. -* init'r - initiator. -This consists of text characters. -* prefix length - (element terms only) the prefix length. -Sub-detail of this region is not shown here. -The PrefixLengthQuasiElementDecl class is used to represent this subregion in DSOM. -* value MTA - (element terms only, simple type only) mandatory text alignment for the value. -When the value has text representation this insures those characters start -on the right bit-boundary for characters of the specified charset encoding. - -The sizes of the alignment regions (alignFill, and the MTA regions) in -the above all depend on where the start of this term is. -That depends on where the prior term ends, if there is a prior term. -That is, the sizes of the alignment regions depend on the enclosing -model groups, and what can appear before this term in that nest. - -Now we look at an expansion of the shared regions: - -[ditaa] -.... - - - +--+ +-----------------------------------------------+ +--+ - | | term | | - | | | | - | | +---------------+ +--------+ +--------------+ | | - ...| | | unique before | | shared | | shared after | | |... - | | +---------------+ +--------+ +--------------+ | | - +-+ +-----------------------------------------------+ +-+ - | | - | | - +-------------------------------+ +-----------------+ - | | - | | - v v - +---------------------+ +---------------------------------------------------+ - |shared | | shared after +---------------------+ | - | | | | sequence after | | - | +--------+ +------+ | | +------+ +------+ +-----+ | +-------+ +-------+ | | - | | content| |elt or| | | |term'r| |term'r| |tSkip| | |postfix| |postfix| | | - | | or | |choice| | | | MTA | | | | | | | sep'r | | sep'r | | | - | | value | |unused| | | | | | | | | | | MTA | | | | | - | | | | | | | | | | | | | | | | | | | | - | +--------+ +------+ | | +------+ +------+ +-----+ | +-------+ +-------+ | | - | | | +---------------------+ | - +---------------------+ +---------------------------------------------------+ - - ^ - | - | - + - Known Alignment Point - -.... - -It is key that the shared region can assume the alignment -fill and MTA regions have been sized per the requirements of this -term. -Hence, the shared region is context independent and its implementation - in schema -compiler objects and in runtime structure representation, need only be represented once. - -.Corner Case: Interior Alignment -[IMPORTANT] --- -The semantics here aren't identical to those of macro expansion, but the difference is hard to -detect. - -The https://opendfdl.github.io/gwdrp-dfdl-v1.0.5_r11/gwdrp-dfdl-v1.0.5-r11-change-tracking.htm[DFDL Spec (recent draft)] -// -// IN URL below, had to escape the underscore with a \ which is not part of the actual URL. -// -includes a link:https://opendfdl.github.io/gwdrp-dfdl-v1.0.5_r11/gwdrp-dfdl-v1.0.5-r11-change-tracking.htm#\_Toc27061169[section (23.6 in the linked draft)] on -this -// Hmmm. just writing _interior alignment problem_ didn't work. This passthrough to HTML will work for sure. -pass:[<i>interior alignment problem</i>]. - -The technique for approximating alignment info described here is slighly less precise in its -analysis because it resets its knowledge of the alignment at each known alignment point. -Whereas pure macro expansion could carry forward knowledge of alignment and perhaps optimize out more -alignment fill regions. - -Hence, it's possible that the technique here will leave some alignment fill regions in place and thereby some -formats will experience circular deadlocks when unparsing due to this variable-length interior-alignment. - -The Daffodil test suite includes examples of this interior alignment that cause unparser circular deadlocks. -(see testOutputValueCalcVariableLengthThenAlignmentDeadlock) - --- - -The shared region contains a text or simple binary value, a nil representation, or inductively, a sequence of terms. -When the term is an element of complex type, an element unused region can follow (depending on the length kind), -and if the term is a choice with dfdl:choiceLengthKind='explicit' then a choice unused region can follow. - -Then, after the shared part, the shared-after region has these subregions: - -* term'r MTA - mandatory text alignment for the terminator. -* term'r - terminator -* tSkip - trailing skip region. This is fixed length (commonly 0) -* postfix sep'r MTA - - mandatory text alignment for a separator in postfix position. -When a separator is defined, this region is populated with up to 7 bits to -obtain text alignment for the characters of the separator in the specified charset encoding. -Note that this encoding and that of the term's initator/terminator are not necessarily the same encoding. -* postfix sep'r - separator when defined and in postfix position. - -The terminator MTA region varies in size based on its starting -alignment, and so it depends on the size of the shared region. - -The postfix separator MTA region similarly varies in size based on where the -tSkip region ended, and so it depends on the size of the shared region, and -the sizes of the terminator MTA region, terminator, and tSkip regions. - -NOTE: The shared-after region does not have any sub-regions the size -of which depend on the context. -However, we separate it from the central shared region in order to -separate all concerns around alignment and fill into separate regions from the -central shared region, which could perhaps be called the shared -content region. -The central shared region therefore does not deal with alignment at all. - -The central idea here is that all the regions to the left (before) the known alignment point are context dependent. -That is, whether these alignment fill and mandatory text alignment regions can be statically determined -in size is dependent on where the prior term ended. - -In contrast, the known alignment point is a fresh start for alignment. The alignment as required by the alignment properties -will have been achieved at that point. Hence, from there and to the right (after), everything can be assessed relative to -this new fresh anchor. - -Inductively, even the context dependent regions are not dependent on terms very far back. -Let's consider the regions in a straddle of two adjacent terms in a sequence: - -// -// DITAA max line length (for github) ruler -------------------------------------------------------------------------| -// -[ditaa] -.... -+----------------------------------------------+ +------------------------------------------------ - Term 1 | | Term 2 - +-------------------------+ | | +-------------------------------------+ - +---+ +---------+ |shared after | | | |unique before | +---+ - | | shared | | | | | | | |shared - | | | | +---------+ | | | | +---------+ | | - | | +-+ +-+ | | +-+ +-+ +-+ |sequence | | | | | |sequence | +-+ +-+ +-+ +-+ +-+ +-+ | | - | | | | | | | | | | | | | | | after | | | | | | before | | | | | | | | | | | | | | | - | | | | | | | | | | | | | | | +-+ +-+ | | | | | | +-+ +-+ | | | | | | | | | | | | | | | - | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - ... | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... - | | +-+ +-+ | | +-+ +-+ +-+ | | | | | | | | | | | | | | | | +-+ +-+ +-+ +-+ +-+ +-+ | | - | | | | | +-+ +-+ | | | | | | +-+ +-+ | | | - +---+ +---------+ | +---------+ | | | | +---------+ | +---+ - ^ +-------------------------+ | | +-------------------------------------+ ^ - | | | | -+----------------------------------------------+ +------------------------------------------------- - | | - | | - | | - | | - + + - Known Alignment Point 1 Known Alignment Point 2 - -.... - -The Known Alignment Point 2 is only context dependent on the possible prior terms back to Known Alignment Point 1. -This is true certaintly when Term 1 is a scalar element or a model-group itself. - -In reality the induction here is more complex because Term 1 might be an optional element or an array, in which case the prior term before Term 1 may also be involved -in computing the region sizes for the Term 2 unique-before regions. -But this never needs to consider anything further back than the prior scalar term, or the start of the shared region of the enclosing model group, which need -only be considered if Term 2 can be first within its enclosing model group. - -CAUTION: Separators may be present even if terms are not, based on dfdl:separatorSuppressionPolicy and a few other properties. -The diagrams show sequence before region as if it was part of the term, but really it may appear even if the term is absent. - - - - - -Given the above, for any term, we can compute its PropEnv, and create -a distinct shared instance for that term for each unique PropEnv. - -Since the shared region contains all the children of any -child-containing term, sharing this insures that the schema -compilation space/speed is proportional to, roughly, the size of the -schema without any non-constant multiplicative factor. -This insures no combinatorial explosion of compilation time. - -=== Optimizing Alignment Regions - -A primary task of the schema compiler is to eliminate alignment fill -(and mandatory text alignment aka MTA) regions that are unnecessary -because the data can be proven to always be properly aligned. - -This is done by computing, for each region, a compile-time alignment approximation. -The AlignmentMultipleOf class represents this information. -AlignmentMultipleOf objects form a mathematical lattice where perfect -alignment is the bottom of the lattice, no alignment (or alignment -multiple of 1 bit) is the top of the lattice, and various multiples -(8, 32, etc.) live in between. For two lattice values a and b, we say -a < b (a is 'weaker than' b) if a is a multiple of b. - -A key observation is this: Each time we create an alignment constraint -for a shared region, this is not dependent on any other constraint. -That is, the constraints on alignment do not propagate from term to term. -A shared region can assume it starts at the alignment that is required -for the term based on the dfdl:alignment property, the initiator MTA -if there is an initiator, and the value MTA for text-representation -simple values. - -Hence, terms that appear down-stream of any shared region do *not* -depend on any constraints propagating from before the shared region. -This greatly reduces the complexity and the distance across the schema -that constraint-changes propagate. -It also insures there is no need for an iterative contraint solver, since nothing -about alignment propagates from one term to the next. -Terms are separated by the fact that the shared region of a term starts from a -known alignment (specified by its alignment and alignment units -property). - - -=== Other Context-Dependent Computations - -By dividing up the representation of a term into the unique context-dependent -and shared context-indepenent parts explicitly we provide a home for the various -calculations the schema compiler performs in order to do other optimizations. - - -[plantuml, target="unshared-shared", format="png"] -.... -hide empty members - -class UnsharedTerm { - - isAlignmentFillNeeded: Boolean - isInitiatorMTANeeded: Boolean - isValueMTANeeded:Boolean - - isPrefixInfixSeparatorMTANeeded: Boolean - - isPotentiallyTrailing: Boolean - - needsBitOrderChange: Boolean - - isScanningForDelimiter: Boolean - isScanningForTerminator: Boolean - - initiatedContentCheck: Unit - -} - -class SharedTerm { - isTerminatorMTANeeded: Boolean - isPostfixSeparatorMTANeeded: Boolean - // along with everything else - -} - -UnsharedTerm -right-> SharedTerm : shared - -.... - -In the above class diagram, members that must be computed on the unshared term object are -shown in the class. -This set of things should always be minimized. -That is, as few things should be computed on unshared term objects as posible. -It is assumed that everything else is computed on the shared term object, but members -specifically about regions discussed in sections above are called out. - -The UnsharedTerm object is effectively state of the enclosing model group DSOM object. -It exists in 1 to 1 correspondence (aka it is "owned" by the enclosing model group object.) - -In the Runtime 1 backend, a processor (parser/unparser) is generated for the shared term object, -and that processor is referenced from the processor generated for the unshared term object. - -== Transition Plan - -IMPORTANT: Once implemented this section can be deleted as it is describing transition from an older Daffodil version to one that implements the shared objects. - -We can examine several of the context-dependent calculations below. - -==== isAlignmentFillNeeded, isInitiatorMTANeeded, isTerminatorMTANeeded, isPrefixInfixSeparatorMTANeeded, isPostfixSeparatorMTANeeded - -These are the output of the analysis of alignments, and they turn on/off the presence of these regions. -These are of particular importance to a streaming unparser since they can require suspension of unparsing until alignment can be determined at runtime. - -==== isPotentiallyTrailing - -This is context dependent because a given term definition can be reused in multiple contexts, some of which where it is potentially trailing, and others of which where it is not. - -==== needsBitOrderChange - -This is used to optimize out suspensions in the unparser. It is context dependent because whether the bit order is changing when the processor arrives on a term depends on what preceded it in the processing. - -==== isScanningForDelimiter - -This is used to determine whether a delimiter stack combinator is used as part of the grammar. It is context dependent because a separator or terminator can be defined on an enclosing model group. - -TBD: It's possible we don't need to wrap this combinator around elements. Manipulating the delimiter stack should only be necessary when delimiters go into and out of scope. So that would mean once per model group, not per term within the model group. - -==== isScanningForTerminator - -This determines whether the schema compiler will allow the entity "%ES;" as a runtime value for the expression value of a dfdl:terminator property. -Specifically if we are scanning for a terminator then the "%ES;" entity is disallowed. - -This is context dependent because the terminator may be defined on an enclosing model group, not on an element that is within that model group. -The element may have a position in the model group such that it may be last, and in that case if it has a length kind that does not use the terminator (e.g., dfdl:lengthKind='pattern') then that terminator is not being scanned for, and so it is allowable for it to be empty string at runtime. - -If this computation is done on the unshared term, it must inquire about many characteristics of the enclosing model group. -It may be that this computation is best done on the model-group object. - -==== initiatedContent Check - -When a model group has the dfdl:initiatedContent="yes" property, all represented term children must have a defined initiator. - -This is context dependent because a global element declaration or a global group definition may define an initiator, but the model group with initiated content may refer to these via an element reference or group reference. -That is, the relationship is not one of lexical enclosure. -The check must be done as part of the unshared term, or on the enclosing model group itself. - -=== Refactoring of Primary DSOM Classes - -Implementing this sharing requires splitting the DSOM Term clases each into a unique and a shared part. - -This will be done by renaming each Term to SharedTerm, and introducing UniqueTerm. - -For example ElementBase will be renamed to SharedElementBase, and new abstract class UniqueElementBase will be created. - -This will occur uniformly across the DSOM Term subclasses/traits. - -Some of the mixins to these classes will go on only one or the other of the shared or unique parts. - -Many of the mixins will have to split into a mixin for the shared part and another mixin for the unique part. -One example of this is the AlignedMixin. This currently deals with alignment regions that are found in unique parts and shared parts, and those -methods must go on different objects. - -The unique term classes will utilize a central memoizing factory to allocate or find shared term instances based on having equivalent PropEnvs. - - - - -=== Testing Notes - - -==== Compiler Instrumentation - -Some standard compiler instrumentation that is dumped out based on options to daffodil compiler, to include: - -* counters of number of distinct shared terms for each global def/decl. -* counters of number of inbound references to each shared term. - -The point is to have metrics that can be compared and which will make it clear if under maintenance the sharing is somehow lost and we're getting bad compile-time behavior again.
