This is an automated email from the ASF dual-hosted git repository.
mbeckerle pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-daffodil-site.git
The following commit(s) were added to refs/heads/master by this push:
new ed8a08c New asciidoc design notes on schema compiler space/speed
issue.
ed8a08c is described below
commit ed8a08ca591896c4bfe3c978b083993736215729
Author: Michael Beckerle <[email protected]>
AuthorDate: Tue Jan 21 16:07:39 2020 -0500
New asciidoc design notes on schema compiler space/speed issue.
There are 3 new notes:
* term-sharing-in-the-schema-compiler.adoc
* hidden-groups.adoc
* namespace-binding-minimization.adoc
They can be read on the Daffodil web site, or you can check-out the
daffodil-site repository and view them directly. There is a PR for review
comments on these asciidoc texts.
https://github.com/apache/incubator-daffodil-site/pull/17
These go along with a wiki document:
https://cwiki.apache.org/confluence/display/DAFFODIL/Analysis%3A+Schema+Compiler+Backpointers+and+Copying
Changes to the aboutAsciidoc.adoc file are just about the approach of using
asciidoc
for this sort of design note.
Based on some preliminary reviews, added transition sections for
isHidden and namespace bindings notes.
DAFFODIL-1444
---
site/dev/aboutAsciiDoc.adoc | 89 ++-
site/dev/design-notes/hidden-groups.adoc | 95 +++
.../namespace-binding-minimization.adoc | 464 ++++++++++++++
.../term-sharing-in-schema-compiler.adoc | 669 +++++++++++++++++++++
4 files changed, 1270 insertions(+), 47 deletions(-)
diff --git a/site/dev/aboutAsciiDoc.adoc b/site/dev/aboutAsciiDoc.adoc
index 971a136..7ec9593 100644
--- a/site/dev/aboutAsciiDoc.adoc
+++ b/site/dev/aboutAsciiDoc.adoc
@@ -69,23 +69,9 @@
image::http://daffodil.apache.org/assets/themes/apache/img/apache-daffodil-logo.
== Example Diagrams
There are numerous kinds of diagrams possible.
-== GraphViz Record-Based Nodes Diagram
-https://graphviz.gitlab.io/documentation/[GraphViz] is probably the most
powerful of the various diagram tools, but with that power comes complexity
that some of the other diagram types are able to overcome by being more
restrictive.
-
-This is an example of graphViz
https://graphviz.gitlab.io/_pages/doc/info/shapes.html#record[Record-based
Nodes] which are very useful for box diagrams showing data layouts when a
packetdiag is too rigid.
-[graphviz]
-....
-digraph structs {
- node [shape=record];
- struct1 [label="<f0> left|<f1> mid\ dle|<f2> right"];
- struct2 [label="<f0> one|<f1> two"];
- struct3 [label="hello\nworld |{ b |{c|<here> d|e}| f}| g | h"];
- struct1:f1 -> struct2:f0;
- struct1:f2 -> struct3:here;
-}
-....
=== PlantUML Diagram
+
http://plantuml.com/[PlantUML] is a component that allows to quickly write:
* Sequence diagram
@@ -110,6 +96,10 @@ The following non-UML diagrams are also supported:
* Mathematic with AsciiMath or JLaTeXMath notation
* Entity Relationship diagram
+An www.plantuml.com/plantuml/uml[online PlantUML authoring tool] is available.
+Though just saving and refreshing the page in a web browser (usually they
auto-refresh
+after 1 second) is also pretty easy.
+
Here is a class diagram example:
[plantuml, target="diagram-classes", format="png"]
@@ -142,39 +132,10 @@ Foo1 -> Foo5 : To database
Foo1 -> Foo6 : To collections
....
-=== Packet Diagram - Easier but Limited to MSBF, Left-to-Right
-http://blockdiag.com/en/nwdiag/packetdiag-examples.html[Packet diagrams] are
useful for any time you want to show data layouts at the level of bits, bytes,
and words. It assumes bit order is _most significant bit first_ and that you
want to number the bits from left to right.
-
-This example supplies a target name for the graphic, which allows it to be
reused not just in this document, but in others also. We specify SVG as the
format since there appears to be bugs in PNG format for packetdiag.
-[packetdiag, target="test_packet_diagram_1", format="svg"]
-....
-packetdiag {
- colwidth = 32
- node_height = 24
- 0-15: Source Port
- 16-31: Destination Port
- 32-63: Sequence Number
- 64-95: Acknowledgment Number
- 96-99: Data Offset
- 100-105: Reserved
- 106: URG [rotate = 270]
- 107: ACK [rotate = 270]
- 108: PSH [rotate = 270]
- 109: RST [rotate = 270]
- 110: SYN [rotate = 270]
- 111: FIN [rotate = 270]
- 112-127: Window
- 128-143: Checksum
- 144-159: Urgent Pointer
- 160-191: (Options and Padding)
- 192-223: data [colheight = 2]
-}
-....
-
-
=== Ditaa Diagram
-The http://ditaa.sourceforge.net/[DITAA] graphics are ASCII-Art converted into
smoother looking drawings. They have the advantage of being visual in the text
file, but the challenge of needing to be
-laid out by hand.
+The http://ditaa.sourceforge.net/[DITAA] graphics are ASCII-Art converted into
smoother looking drawings. They have the advantage of being visual in the text
file.
+
+These diagrams are quite easily drawn then cut/pasted to/from actual asciidoc
digrams using an online tool called http://asciiflow.com/[asciiflow].
[ditaa]
....
@@ -196,6 +157,39 @@ laid out by hand.
+-----------------------------------+
....
+Note: below is the widest ditaa diagram you can draw that will fit in the
line-length that
+github's code-review/pull-request window can display without line-wrap.
+If the lines wrap in your ditaa diagram you really lose the ability to comment
on them
+in their textual form in the asciidoc:
+
+[ditaa]
+....
++--------------------------------------------------------------------------------------------------------------------+
+| make ditaa diagrams no wider than this box
|
++--------------------------------------------------------------------------------------------------------------------+
+....
+
+// Here's that as a comment you can cut/paste into an asciidoc file
+//
+// DITAA max line length (for github) ruler
-------------------------------------------------------------------------|
+//
+
+== GraphViz Record-Based Nodes Diagram
+https://graphviz.gitlab.io/documentation/[GraphViz] is probably the most
powerful of the various diagram tools, but with that power comes complexity
that some of the other diagram types are able to overcome by being more
restrictive.
+
+This is an example of graphViz
https://graphviz.gitlab.io/_pages/doc/info/shapes.html#record[Record-based
Nodes] which are very useful for box diagrams showing data layouts when a
packetdiag is too rigid.
+[graphviz]
+....
+digraph structs {
+ node [shape=record];
+ struct1 [label="<f0> left|<f1> mid\ dle|<f2> right"];
+ struct2 [label="<f0> one|<f1> two"];
+ struct3 [label="hello\nworld |{ b |{c|<here> d|e}| f}| g | h"];
+ struct1:f1 -> struct2:f0;
+ struct1:f2 -> struct3:here;
+}
+....
+
== Example Block Diagram
The http://blockdiag.com/en/blockdiag/[blockdiag] diagram type is for basic
box/arrow diagrams.
@@ -212,6 +206,7 @@ blockdiag {
"very easy!" [color = "orange"];
}
....
+
=== Sequence Diagram
The http://blockdiag.com/en/seqdiag/[seqdiag] diagram type creates sequence
diagrams.
diff --git a/site/dev/design-notes/hidden-groups.adoc
b/site/dev/design-notes/hidden-groups.adoc
new file mode 100644
index 0000000..e8f03d4
--- /dev/null
+++ b/site/dev/design-notes/hidden-groups.adoc
@@ -0,0 +1,95 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc
automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by
default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See
https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one
sentence-per-line writing technique.]
+// This makes textual diffs of this file useful in a similar way to the way
they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Hidden Groups
+
+=== Introduction
+
+The DFDL language enables groups to be hidden, meaning not part of the logical
infoset.
+
+This is achieved by way of _hidden group references_.
+This implies that a global group definition can be referenced from some group
references which are ordinary, non-hidden, group references, and from other
hidden group references.
+Hence, the contents of the group definition are not themselves inherently
hidden or not, as the hiddenness is based on how the group is referenced.
+
+Of course if all uses of a particular group definition are hidden group
references, then the schema compiler could provide that all elements within
that group are always hidden.
+This is expected to be fairly common, as often hidden groups are designed to
contain the hidden reprsentation details of data that has a simpler logical
appearance in the infoset.
+
+Nevertheless, Daffodil must accomodate that the same group definition can be
used both in hidden and non-hidden manners.
+
+=== Static Analysis by the Schema Compiler (i.e., there is none)
+
+The schema compiler could provide information about whether a given term of
the schema is always hidden, sometimes hidden, or never hidden.
+However, the runtime overheads are anticipated to be so small that
optimizations based on this information are expected to be unnecessary.
+Hence, no static analysis of whether a term is always/sometimes/never hidden
is performed.
+
+=== Runtime Principles of Operation
+
+The implementation takes advantage of the processor (parser/unparser)
traversal of the DFDL schema components following an in-order traversal
pattern, so that a stack-discpline can be used when traversing hidden-group
references. No actual stack is needed, only a counter.
+The details are:
+
+* The PState/UState contains a counter (initial value 0) of the number of
hidden group references currently in the nest.
+* When parsing/unparsing a hidden group ref, this counter is incremented.
+When unwinding from that unparse, the counter is decremented.
+No action is required for non-hidden group references.
+* When an infoset element is created, the setHidden() method is called if the
counter is non-zero.
+* At the end of processing, the counter should be zero.
+
+=== Runtime Data Structures - Runtime 1
+
+In Daffodil's "Runtime 1" backend, these are the runtime data structure
features to support hidden group references:
+
+* Infoset elements have an isHidden() boolean method, and a setHidden() setter.
+* The PState/UState contains the counter described above.
+* The ModelGroupRuntimeData object contains an isHidden flag that is true if
the term corresponds to a hidden group reference.
+It is this flag that is inspected to determine if the PState/UState counter
should be incremented/decremented or not.
+* The increment/decrement logic is centralized in Parser.parse1() method and
Unparser.unparse1() method.
+These inspect the current TermRuntimeData object, and when it is a
ModelGroupRuntimeData, they check the isHidden flag.
+* At the end of processing where checking that the various runtime-stacks and
pools are properly emptied/restored, an additional check is done to insure the
hidden counter of the PState/UState has been returned to zero.
+This check need only be performed on normal exits. If the processor terminates
abnormally, we needn't check.
+* The debugger/trace should display, as part of the state, whether the current
context isHidden (counter > 0) or not.
+* Parser backtracking must save/restore the counter.
+* Unparser suspended UStates do not need to copy the counter value. That is,
this counter need only exist in UStateMain.
+** This holds because infoset elements are created as part of the state of
suspensions, and those infoset elements will have setHidden() called or not
depending on the counter. Hence, the counter value of a suspended UState is not
needed when evaluating a suspension.
+
+=== Testing
+
+* Tests cover where the same group definition is used both hidden and
non-hidden within the same schema.
+* Tests cover this for hidden sequences and hidden choices.
+Note that a hidden group references always uses an xs:sequence element, but
the referenced group can be a choice group definition.
+
+
+== Transition Plan From Daffodil 2.5.0
+
+(Delete this section once implementation is complete.)
+
+The existing v2.5.0 isHidden implementation consists of code that populates a
member of class ElementRuntimeData (Runtime 1) isHidden member.
+
+All code for populating that member, and the member itself, can be removed.
Anything it calls which is used only for that purpose can be removed.
+
+Any code that assumes the isHidden characteristic of elements is static
information must be removed.
+
+All access to whether an element isHidden must call the isHidden method of
InfosetElement (aka DIElement) object instead of on the ERD
(ElementRuntimeData) object.
+
+There is no conversion of existing code to a new design, as the entire old
mechanism, and its underlying assumptions, are being removed, and fully
replaced by the new mechanism.
diff --git a/site/dev/design-notes/namespace-binding-minimization.adoc
b/site/dev/design-notes/namespace-binding-minimization.adoc
new file mode 100644
index 0000000..92ccfbb
--- /dev/null
+++ b/site/dev/design-notes/namespace-binding-minimization.adoc
@@ -0,0 +1,464 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc
automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by
default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See
https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one
sentence-per-line writing technique.]
+// This makes textual diffs of this file useful in a similar way to the way
they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Namespace Binding Minimization
+
+=== Introduction
+
+DFDL schemas are XML schemas and so DFDL inherits the namespace system of XML
and XML Schema for composing large schemas from smaller ones, for reusing
schema files, and for managing name conflicts.
+
+A DFDL Infoset isn't necessarily represented as XML however.
+Some representations won't have any ability to deal with namespaces (JSON for
example), and so Daffodil will sometimes issue warnings when compiling a schema
if the namespace usage will not allow unambiguous representation without
namespaces.
+
+Most representations of DFDL Infosets will, like XML, use some representation
of the namespaces of elements, and in textual forms this will most commonly be
by way of namespace prefixes.
+XML is not the only representation that uses namespaces, however, so this
should not be taken as an entirely XML-specific discussion.
+
+There are these goals for namespace-binding minimization.
+
+. Clarity: Infosets that have redundant namespace bindings are very hard to
read and understand, and require namespace-binding-aware tooling to compare
them, or clumsy post-processing to remove the excess bindings.
+
+. Performance: Attaching an element to the infoset at runtime should take
constant time.
+
+. Consistency: The prefix-to-namespace bindings used should be drawn from
those expressed on the DFDL schema by the schema author, and the prefixes used
and bindings introduced when an element is attached to the infoset should be
consistent with the set of namespace prefix definitions in place at the point
where the element's declaration lexically appears in the DFDL schema.
+
+These goals are in some tension.
+Consider 4 elements named A, B, C, and Q.
+Suppose element A contains element B, which contains element Q.
+Suppose elsewhere in the same infoset element A contains element C which
contains element Q.
+From the perspective of element Q, the set of namespace bindings surrounding
it are those from (A, B) or those from (A, C).
+Suppose element Q requires, and introduces, a namespace with prefix "qns"
bound to namespace "urn:Q_Namespace".
+Suppose element C also introduces this same namespace binding.
+Then when element Q appears inside element B, its namespace binding for "qns"
is needed.
+But when element Q appears inside element C, its namespace binding for "qns"
is redundant with one already provided by element C.
+
+The conclusion is that the minmal set of namespace bindings introduced by an
element depends on the nesting of elements.
+
+The basic technique is to store, on the runtime element data structure
(DPathElementCompileInfo), the complete set of lexical namespace bindings
present for the element declaration in the DFDL schema document.
+
+==== Namespace Bindings come from the Element Declarations, not Element
References
+
+Consider two schema files:
+
+```xml
+<!-- file foo.dfdl.xsd -->
+<schema
+ xmlns:pre1="namespace1"
+ xmlns:ns1="differentNamespace">
+ <import namespace="namespace1" schemaFileLocation="bar.dfdl.xsd"/>
+ ...
+ <element name="root">
+ <complexType>
+ <sequence>
+ <element ref="pre1:foo" maxOccurs="unbounded"/>
+ </sequence>
+ </complexType>
+ </element>
+
+</schema>
+
+<!-- file bar.dfdl.xsd -->
+<schema targetNamespace="namespace1"
+ xmlns:ns1="namespace1"
+ xmlns:pre1="someOtherNamespace">
+
+ <element name="foo" ..../>
+</schema>
+```
+In the above we have a conflict over the use of the prefix "pre1".
+Now consider an XML document corresponding to this with element 'foo' inside
the 'root' element:
+
+```xml
+<root xmlns:pre1="namespace1"
+ xmlns:ns1="differentNamespace">
+ ...
+ <ns1:foo
+ xmlns:ns1="namespace1"
+ xmlns:pre1="someOtherNamespace">
+ ...
+ </ns1:foo>
+ ...
+</root>
+```
+
+Notice that element 'foo' appears inside 'root' using the "ns1" prefix but it
also introduces a binding for that prefix which supercedes that of the
enclosing environment.
+The prefix "pre1" cannot be used for element 'foo' because in the namespace
bindings of the bar.dfdl.xsd schema document, the "pre1" prefix is bound to
"someOtherNamespace".
+
+This example illustrates that each element must use, and introduce, the
lexically defined prefixes from the point where the element is declared.
+Not from the point of element reference.
+
+Since element 'foo' is recurring, it's unfortunate, but every single instance
will, textualized, carry these namespace bindings.
+E.g.,
+
+```xml
+<root xmlns:pre1="namespace1"
+ xmlns:ns1="differentNamespace">
+ ...
+ <ns1:foo
+ xmlns:ns1="namespace1"
+ xmlns:pre1="someOtherNamespace">
+ ...
+ </ns1:foo>
+ <ns1:foo
+ xmlns:ns1="namespace1"
+ xmlns:pre1="someOtherNamespace">
+ ...
+ </ns1:foo>
+ <ns1:foo
+ xmlns:ns1="namespace1"
+ xmlns:pre1="someOtherNamespace">
+ ...
+ </ns1:foo>
+
+ ...
+</root>
+```
+
+This problem is not one Daffodil strives to solve.
+The schema author can simply avoid these sorts of name clashes and this
problem will not occur.
+Automatic renaming of prefixes to avoid this problem is considered
unwarranted, as it will confuse users.
+
+
+
+=== Namespace Minimization
+
+==== Only Element Namespace Prefix Bindings
+
+Only namespace definitions associated with element declarations need to ever
be considered for the infoset.
+Namespace definitions that define prefixes used for type, group, format, or
escapeScheme references are not included
+in the namespace definitions carried on infoset elements.
+
+==== Avoid Prefix "tns" (or Other Common Ambiguous Names) When Possible
+
+Many DFDL schemas will define prefix "tns" to be that schema document's target
namespace.
+
+This same problem could occur for other prefixes. The "tns" convention is just
a common one.
+
+If the prefix "tns" is ambiguous across the schema set (also used by other
schema documents, but for different namespaces),
+then its use is undesirable.
+
+If a schema document defines both "tns" and other prefixes for the target
namespace, then another prefix is preferred for
+use as the prefix of elements created from declarations in that schema
document.
+
+This situation arises commonly for the default namespace (no prefix, defined
by `xmlns="namespaceURI"`). If
+this is ambiguous across the schema set (highly likely), then an available
alternative prefix (from that schema document)
+is preferred.
+There is actually no difference between using "tns" and the default namespace.
Both are just commonly used, and frequently ambiguous across the schema set.
+
+(This all generalizes to any prefix which is ambiguous across the schema set.)
+
+==== Corner Cases
+
+There are numerous ways schema authors can use and reuse namespace prefixes
that can lead to cluttered infosets.
+
+Other than minor heuristics to choose among alternative available prefix
definitions, Daffodil does not try to improve on the
+namespace prefix problem on behalf of schema authors.
+
+===== No Prefix At All
+A legal schema document can define a target namespace and define no prefix for
it at all.
+
+In this case, the only way elements of that schema document can be used is
some other schema document must provide a prefix definition.
+Daffodil chooses a prefix from those available in the schema set
(deterministically - e.g., shortest prefix, with ties resolved by alphanumeric
order, avoiding ambiguous prefixes like "tns").
+
+CAUTION: TBD: Should we issue a warning or even make this a schema definition
error?
+
+===== Only "tns" or Only the Default Namespace
+A schema defines "tns" (or other very common prefix like "pre" or "p") for its
target namespace, but defines no other prefix that can be used as an
alternative.
+
+Daffodil does nothing here to improve on the situation where there will be
many inner namespace re-bindings of the "tns" like:
+
+```xml
+<tns:foo xmlns:tns="namespace1">
+ <tns:bar xmlns:tns="namespace2">
+ <tns:quux xmlns:tns="namespace3">
+ ...
+ </tns:quux>
+ </tns:bar>
+</tns:foo>
+```
+
+This sort of thing can happen if schema authors make extensive use of the
default namespace (no prefix).
+For example, a schema document can define a target namespace, then define that
namespace to be the default, with no other namespace prefix defined.
+In that case you can have infosets like this:
+
+```xml
+<foo xmlns="namespace1">
+ <bar xmlns="namespace2">
+ <quux xmlns="namespace3">
+ ...
+ </quux>
+ </bar>
+</foo>
+```
+
+==== Undefining the Default Namespace
+
+Many schemas will not define the default namespace.
+
+If an element is defined in a schema which defines the default namespace to a
URI, and nested with that element are other elements from schemas that
+do NOT have a definition for the default namespace, then if there are
unqualified names in the latter schema that are supposed to be in _no
namespace_,
+the default namespace must be explicitly undefined.
+
+Consider two schema files:
+
+```xml
+<!-- file foo.dfdl.xsd -->
+<xs:schema
+ xmlns="default1"
+ xmlns:ns2="namespace2" >
+
+ <xs:import namespace="namespace2" schemaFileLocation="bar.dfdl.xsd"/>
+ ...
+ <xs:element name="root">
+ <xs:complexType>
+ <xs:sequence>
+ <xs:element ref="ns2:foo" maxOccurs="unbounded"/>
+ </xs:sequence>
+ </xs:complexType>
+ </xs:element>
+
+</xs:schema>
+
+<!-- file bar.dfdl.xsd -->
+<schema targetNamespace="namespace2"
+ xmlns:ns2="namespace2"
+ elementFormDefault="unqualified"
+ xmlns="http://www.w3.org/2001/XMLSchema"> <!-- default namespace used for
schema -->
+
+ <element name="foo">
+ <complexType>
+ <sequence>
+ <element name="bar" .../><!-- no namespace -->
+ </sequence>
+ </complexType>
+ </element>
+</schema>
+```
+In this case, an instance of root, containing 'foo' containing 'bar' requires:
+```xml
+<root
+ xmlns="default1"
+ xmlns:ns2="namespace2">
+ <ns2:foo>
+ <bar xmlns=""> <!-- undefine default -->
+ ...
+ </bar>
+ </ns2:foo>
+</root>
+```
+This undefine shown above is needed even though the bar.dfdl.xsd schem has a
default namespace definition, because the local element names are in
no-namespace.
+The default namespace in bar.dfdl.xsd is actually not used for reference to
elements.
+So it is tantamount to not having the default namespace defined in
bar.dfdl.xsd at all.
+
+CAUTION: Some of this minimization may happen upon conversion to XML, and may
happen automatically depending on XML libraries.
+That is, an element with no namespace displayed in a context which has a
default namespace definition may automatically insert `xmlns=""`.
+
+== Namespace Binding Minimization Algorithm
+
+The technique described here assumes that one needs to render the infoset to
XML text using standard printing, i.e., using no special XML library.
+Hence, every namespace prefix binding must be explicitly represented in the
output XML text.
+
+The basics are:
+
+. For every element declaration, capture the lexical namespace scope
(`scala.xml.NamespaceBinding`) from its element declaration XML in its schema
document.
+Save this on the runtime data structure for the element. In Runtime 1, this
would be the DPathElementCompileInfo. (This is longstanding functionality in
Daffodil since before version 1.0.0)
+
+. Excepting on the Root, remove any namespace binding that is unambiguous
across the schema, and which appears on the root.
+
+. For each element declaration, the remaining namespace bindings and assigned
prefix to be used are assigned based on the minimization rules describe above
(e.g., about avoiding "tns" when possible.)
+
+That is all that is done at schema compile time and at parse time up to the
point where a textual representation (such as XML) needs to be output.
+
+The DFDL Infoset tree is constructed with InfosetElement nodes that point to
this compile time DPathElementCompileInfo structure, and no processing of
namespace bindings occurs.
+
+However when converting an infoset element into XML examine the namespace
bindings of the element and those of the enclosing parent element.
+
+.. Any that are redundant across the two are dropped.
+.. New definitions introduced by the child are output as bindings
+.. Redefinitions are output as bindings
+.. If the element has no namespace, and the parent (or any super-parent) has a
default namespace binding, then add an undefine binding for the default
namespace.
+
+This algorithm requires non-constant-time (worst case) processing at runtime;
however, there is no overhead unless there are ambiguities among the namespace
bindings and when namespace bindings at nodes beneath the root are required.
+In addition, the number of such cases in any _real_ schema will be small, so
the algorithmic complexity worst-case here is far less important than the
constant factor here.
+Attaching an element to the infoset is a common operation.
+These namespace binding machinations have the potential to be equally costly,
per binding, to the general overhead of attaching the infoset element node.
+
+Our standard design principle is, however, to not worry about overheads like
this which are often not going to occur in real schemas, unless performance
profiling shows them to be a hot-spot.
+
+Sensibly-designed schemas will have no overhead from this namespace-binding
combining.
+
+=== Converting the DFDL Infoset to XML in One Pass (Streaming)
+
+Note that eliminating prefix definitions that are unused in a particular XML
document is not compatible with streaming.
+It requires two passes to determine if a prefix is ever used to decide whether
it can be omitted or must be included.
+
+The only alternative to this is to introduce new namespace prefix definitions
only at their point of use.
+That would, however, be inconsistent with our goal of clarity and avoiding
namespace prefix clutter in the schema.
+
+It is preferable to output extra namespace bindings on the root element than
to litter the document with namespace bindings at
+interior XML elements.
+
+Daffodil aspires to streaming parsing and unparsing. A streaming parser will
output parts of the infoset without waiting to
+know if children will eventually appear that require the namespace prefix
definitions.
+As a result, all namespace prefix definitions which _may_ be required are
included.
+Most commonly this will result in extra unused namespace prefix definitions
having been output on the start tag of the root element.
+
+=== API XML-Fragment Mode - For Clarity: Avoid Namespace Bindings on the Root
+
+When using the message streaming API and converting the parse Infoset to XML,
each message is created as XML text by the parser and associated
InfosetOutputter, and converting one relatively small message to XML may result
in far more characters used to represent the namespace bindings on the root of
the message than the rest of the message occupies in XML text.
+
+The API provides a method to enable XML-fragment mode.
+In this mode a method can be called to retrieve namespace prefix bindings that
would appear on the root (i.e, on the root XML element of each message) if a
complete XML data document were being created.
+Subsequent calls to parse in XML-Fragment mode create XML which has no
namespace bindings on the root element of each message.
+This is an XML fragment because it lacks the namespace bindings needed for it
to be a complete XML document.
+This is in effect leaving it up to the caller whether and when to append the
namespace bindings to the XML text.
+
+This option is provided because the caller may not want to construct complete
XML documents, and the namespace bindings in use for the XML may be well-known
by the processing system.
+
+This is less a performance optimization (XML text is really verbose, and this
optimization is only scratching the surface).
+This feature is about clarity and coping with XML when using it for small data
documents corresponding to small communications messages.
+Small XML data documents can be overwealmed by the volume of namespace
definitions.
+This is particularly likely if they are for small data messages created from
large DFDL schemas with many schema documents and many namespaces.
+
+As an example consider:
+```xml
+
+<gn:Feature xmlns:cc="http://creativecommons.org/ns#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:gn="http://www.geonames.org/ontology#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"><rdfs:isDefinedBy>sws/3/about.rdf</rdfs:isDefinedBy><gn:name>Zamīn
Sūkhteh</gn:name><gn:alternateName> [...]
+```
+This is almost impossible to understand, given that the first 1/3 of it is
just namespace bindings.
+
+Without the namespace bindings it is easier. It looks like:
+```xml
+<gn:Feature><rdfs:isDefinedBy>sws/3/about.rdf</rdfs:isDefinedBy><gn:name>Zamīn
Sūkhteh</gn:name><gn:alternateName><lang>fa</lang><name>Zamīn
Sūkhteh</name></gn:alternateName><gn:alternateName><lang>fa</lang><name>زمين
سوخته</name></gn:alternateName><gn:featureClass>ontology#S</gn:featureClass><gn:featureCode>ontology#S.CRRL</gn:featurecode><gn:countryCode>IR</gn:countryCode><wgs84_pos:lat>32.45831</wgs84_pos:lat><wgs84_pos:long>48.96335</wgs84_pos:long><gn:parentFeature>sws/3202991/</gn:
[...]
+```
+
+With line endings after each element end tag, it is quite easy to understand.
+```xml
+<gn:Feature><rdfs:isDefinedBy>sws/3/about.rdf</rdfs:isDefinedBy>
+<gn:name>Zamīn Sūkhteh</gn:name>
+<gn:alternateName><lang>fa</lang><name>Zamīn Sūkhteh</name></gn:alternateName>
+<gn:alternateName><lang>fa</lang><name>زمين سوخته</name></gn:alternateName>
+<gn:featureClass>ontology#S</gn:featureClass>
+<gn:featureCode>ontology#S.CRRL</gn:featurecode>
+<gn:countryCode>IR</gn:countryCode>
+<wgs84_pos:lat>32.45831</wgs84_pos:lat>
+<wgs84_pos:long>48.96335</wgs84_pos:long>
+<gn:parentFeature>sws/3202991/</gn:parentFeature>
+<gn:parentCountry>sws/130758/</gn:parentCountry>
+<gn:parentADM1>sws/127082/</gn:parentADM1>
+<gn:nearbyFeatures>sws/3/nearby.rdf</gn:nearbyFeatures>
+<gn:locationMap>3/zamin-sukhteh.html</gn:locationMap>
+</gn:Feature>
+```
+
+The XML Infoset Inputter also has a feature allowing an API method to supply
the root-level namespace bindings once, not on the root element of every
XML-fragment delivered for unparsing.
+
+The symmetry of the API insures that one can unparse the XML output from a
parse that is creating XML in this fragment mode.
+
+The Daffodil CLI has an option to add XML-fragment mode to message streaming
behavior for parsing and unparsing.
+
+== Transition Plan from Daffodil 2.5.0
+
+(Delete this section once implementation is complete.)
+
+The existing design in Daffodil 2.5.0 assumes that every element declaration
is unique, not shared, and so the entire path from that element's declaration
object to the root element is well known and unique.
+
+=== QNames and Name Resolution
+
+
+The QNameBase.scala and ResolvesQNames traits do not need modification, but
their function needs to be understood to know what *does* have to change.
+
+Below shows the classes used for reprsenting the names of named things in a
DFDL schema, and for referring to named things in a DFDL schema:
+
+[plantuml]
+....
+
+abstract class QNameBase {
+ "shared implementation"
+}
+abstract class NamedQName {
+def matches(RefQNameBase)
+}
+abstract class RefQNameBase {
+def matches(NamedQName)
+}
+class RefQName
+class StepQName
+class LocalDeclQName
+class GlobalQName
+NamedQName -up-|> QNameBase
+RefQNameBase -up-|> QNameBase
+LocalDeclQName -up-|> NamedQName
+GlobalQName -up-|> NamedQName
+RefQName -up-|> RefQNameBase
+StepQName -up-|> RefQNameBase
+RefQNameBase -right-> NamedQName : "ref"
+
+
+....
+
+We distinguish names given to objects by their definitions/declarations from
names referring to those by way of the NamedQName and RefQName distinction.
+A NamedQName is created for a named thing. A RefQName is created for points of
reference to it. The two can never be confused because you can't create a
RefQName from a declaration/definition (that would create a NamedQName), and
you cannot create a NamedQName from the "ref" property (that will create a
RefQName). Furthermore, you can't test two NamedQNames to see if they match,
you have to test if a RefQName refers to a NamedQName.
+
+All these objects are created via methods on the singleton QName object.
+
+==== ResolvesQNames trait
+
+This trait provides the resolveQName(qnString: String) : RefQName method which
calls the QName.resolveRef() method supplying the necessary namespace binding
information from the XML of the schema component that is needed to resolve
QName prefixes to specific namespaces.
+
+This is done by wau of the public namespaces member. The scala.xml.Node class
has a scope member which returns type NamespaceBinding. While singular, this is
not a single binding, but a chained data structure where a first namespace
binding object is chained to a second and subsequent one. This not an ordinary
Seq() style collection. Hence the ResolvesQNames trait provides member:
+
+```scala
+val namespaces = xml.scope
+```
+
+=== ElementBase Changes
+
+The existing ElementBase class has members, all of which are subject to
revision/removal.
+
+* Outputs
+** thisElementsNamespace - target namespace of defining schema or no-namespace
if a local element decl with elementFormDefault = "unqualified", or no target
namespace.
+** thisElementsNamespacePrefix - This is the prefix that will be used for this
element when converting to XML. This will change to be different from the
provided namespace prefix only if we choose a non-"tns" equivalent, or generate
a namespace prefix for particular cases.
+** thisElementsRequiredNamespaceBindings - This will be removed. This can't be
statically computed like this anymore.
+** protected minimizedScope - This will be removed or made optional. Ths can't
always be statically computed like this anymore. This will be computed instead
at runtime.
+
+* Minimization algorithm machinery - these members are likely no longer
needed. What they were computing statically and probably very inefficiently
needs to be done at runtime (sometimes), and far more efficiently.
+** private emptyNSPairs
+** private myOwnNSPairs
+** private myParentNSPairs
+** private myUniquePairs
+** private def pairsToNSBinding
+** private parentMinimizedScope
+
+=== Runtime 1 changes
+
+* DPathElementCompileInfo
+** Cleanup: `val name: String` should be removed as a constructor argument and
be obtained from the namedQName via `def name = namedQName.local`
+
+* ElementRuntimeData
+** namespaces - should stay the same.
+** targetNamespace - should stay the same
+** targetNamespacePrefix - computation of this may/will change in case of,
e.g., choosing one that is not "tns" or other ambiguous prefix.
+** thisElementsNamespacePrefix - computation of this may/will change similarly
to avoid "tns" or other undesriable/ambiguous prefix.
+** minimizedScope - likely changes per the "namespace binding minimization
algorithm" above. Should be renamed to reflect proper usage as a statically
computed part of namespaces that must be incorporated at runtime when an
infoset element is attached to a parent. The new name might be
"nonRootMovedScopes" or soemthing to reflect that these are bindings which
could not be statically just be relocated onto the root element. This will be a
primary input to the new runtime algorithm that a [...]
diff --git a/site/dev/design-notes/term-sharing-in-schema-compiler.adoc
b/site/dev/design-notes/term-sharing-in-schema-compiler.adoc
new file mode 100644
index 0000000..a8d54a5
--- /dev/null
+++ b/site/dev/design-notes/term-sharing-in-schema-compiler.adoc
@@ -0,0 +1,669 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc
automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by
default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See
https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one
sentence-per-line writing technique.]
+// It is acceptable to start each sentence on a new line, but then wrap the
lines to a reasonable length.
+// It's the starting on a new line that matters.
+//
+// This makes textual diffs of this file useful in a similar way to the way
they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Term Sharing in the Schema Compiler
+
+=== Introduction
+
+The DFDL language has a composition property known as _referential
transparency_.
+It is not supposed to matter whether you lift a part of the schema out
+and create a reusable group, reusable type, or reusable element from
+it.
+
+Because DFDL (version 1.0) does not allow recursive definitions of any
+kind, it is theoretically possible to implement this by treating all
+reusable definitions in the language like macros.
+I.e., inline-expand all definitions at their point of use.
+This will work for schemas up to some size. However, it leads to an
+unacceptable expansion in schema compilation time and space, as the
+size of the schema once all these inline substitutions have been
+performed can be exponentially larger than the original
+non-substituted schema.
+
+To avoid this undesirable combinatorial explosion, the Daffodil schema
+compiler arranges to achieve the same macro-like semantics, while
+still sharing the core parts of the reusable components so they need
+only be compiled once for each unique way the component is used.
+
+This is also part of eventually enabling an extension of the DFDL
+language to allow recursive definitions.
+
+The dfdl:alignment property is one of the largest contributors to
+complexity in sharing objects by the schema compiler.
+
+Alignment will be used as the example in this design note to
+illustrate the schema compiler behavior.
+
+=== DSOM Term Objects
+
+DSOM is the Daffodil Schema Object Model.
+Within this model, the components from a DFDL schema that actually have
+representations in the data stream or are computed, are called _terms_ and
DSOM models them as
+subclasses of the class/trait Term.
+Terms which are computed (via dfdl:inputValueCalc) are said to have _no
representation_ in the data stream.
+
+For this discussion we are concerned only with _concrete_ terms, meaning those
that do have representation.
+
+The classes in Daffodil that are used for concrete terms are:
+
+* ElementRef
+* Root (A degenerate element ref to the root element)
+* LocalElementDecl
+* the quasi-elements:
+
+** PrefixLengthQuasiElementDecl (Fake local element decl used for the
+ length fields of Prefixed-length types)
+
+** RepTypeQuasiElementDecl (Fake local element decl used as the
+ representation of elements that are computed via the dfdl:repType
+ mechanism.)
+
+* Sequence
+* SequenceGroupRef
+
+* ChoiceBranchImpliedSequence (The sequence that is inserted when a
+ choice branch is a single element decl/ref)
+
+* Choice
+* ChoiceGroupRef
+
+A
https://cwiki.apache.org/confluence/display/DAFFODIL/DFDL+Schema+Object+Model+%28DSOM%29+with+UML[UML
Diagram] is available showing these classes.
+
+=== Term Representation Regions
+
+Consider the representation of a term, which we call the term's
+_region_, as appearing between two other term regions in the data
+stream as illustrated below:
+
+[ditaa]
+....
+---+ +-----------------------------------------------+ +---
+ | | term | |
+ | | | |
+ | | +---------------+ +--------+ +--------------+ | |
+...| | | unique before | | shared | | shared after | | |...
+ | | +---------------+ +--------+ +--------------+ | |
+ --+ +-----------------------------------------------+ +--
+....
+In the above, lower position bits are to the left.
+Higher position bits to the right.
+In the diagram, we see that the term's region consists of 3
+sub-regions, named _unique before_, _shared_, and _shared after_. Each
+term appears in a context of possible additional terms before it and
+after it which are shown with ellipsis above.
+
+A term can be an element or a model-group (sequence or choice).
+By far the most common situation is that a term appears inside a model group.
+There are two exceptions. The _root_, and the model-group of a complex type.
+
+The core idea here is that we are separating the representation of a
+term into unique and shared parts.
+The unique region is affected by the surrounding model group context.
+The shared regions are, in many cases, sharable across instances of
+this term and is independent of the surrounding model-group context.
+So if the term was defined by way of a reusable global definition,
+then the goal is that there need be only one implementation of the
+shared part(s).
+
+There are some limits to this sharing.
+A single implementation is not always achievable, but generally one or
+a small number of implementations are possible.
+
+=== Limits to Sharing
+
+Consider if this term above was defined by way of a XSD group
+reference.
+The DFDL properties expressed on the group reference must
+be combined with those expressed on the group definition, to form the
+complete set of properties associated with the term.
+This combining creates situations where a given shared group definition can
have
+wildly different representations for different uses.
+Consider this:
+
+```xml
+<group name="g">
+ <sequence>
+ <element name="a" type="xs:int"/>
+ <element name="b" type="xs:int"/>
+ <element name="c" type="xs:int"/>
+ </sequence>
+</group>
+
+....
+
+<element name="e">
+ <complexType>
+ <sequence>
+ ... some stuff here ...
+ <group ref="tns:g" dfdl:separator="|"/> <!-- separated -->
+ ... more stuff here ...
+ <group ref="tns:g"/> <!-- not separated -->
+ ... yet more stuff here ...
+ </sequence>
+ </complexType>
+</element>
+```
+
+The two uses of the group definition for 'g' are wildly different, in
+that one has separators, the other does not.
+
+This is a rather extreme example, but DFDL implementations must allow
+for this.
+
+Experience with DFDL schemas to-date (2020-01-21) is that this is very
+rare and appears only in test cases designed to exercise it.
+However, less extreme versions of this are possible, such as different
+uses all being separated and delimited textual data, but where the
+distinct uses do not all use the same separator characters, so the
+separator to be used would be specified differently on each group
+reference. Another expected variation could be separated sequence
+groups where some uses are infix separator and others are prefix or
+postfix separator.
+
+Anecdotally, the vast majority of schemas seen to-date reuse groups
+entirely, that is specifying no properties at all on the group
+reference.
+
+==== Property Environment or PropEnv
+
+We define the _property environment_ or _PropEnv_ of a term, as the
+complete set of properties for that term, combining them from all the
+schema components that define the term. This can include element
+references, group references, local or global element declarations,
+global group definitions, and local and global type definitions.
+
+A key observation is that the PropEnv of a term is entirely defined by:
+
+* local properties (including any dfdl:ref property) on the schema component
for the term.
+** E.g., for an element reference, the local properties expressed on the
element reference itself.
+* default format object at top level of the schema where the term lexically
appears.
+* definition object being referenced.
+** E.g., for an element reference, the global element declaration being
referenced.
+
+So if two terms, say group references, have the same PropEnv, then we
+can share much about their definitions when implementing them.
+
+Hence, when constructing the Daffodil schema compiler's representation
+for a term, we can maintain a cache for each global definition, with
+the key of the cache being the PropEnv.
+When a given term uses a global definition, if the point of use has
+the same PropEnv as another, both can share the part of the term that
+is context independent, that is, the shared region.
+
+The actual components of the PropEnv of a term are slighly more
+complicated than described above.
+The table below gives the components of the PropEnv for the various
+kinds of terms. Note that the "local properties" below includes any
+dfdl:ref property if it appears, but equality of any property which
+has as its value a QName, like dfdl:ref, the property value is based
+on a resolved QName, not the QName string.
+
+[cols="2,6a"]
+.PropEnv Components
+|===
+|Term Definition (one PropEnv subtype per) | PropEnv Components
+
+|Local Element Decl with type reference
+|
+* Local properties expressed on the Local Element Decl (Set of property name +
value pairs)
+* Lexically enclosing default format (object ref)
+* Type definition (object ref) or primitive type (object ref)
+| Local Element Decl with anonymous Simple Type Definition
+|
+* Combined set of local properties expressed on the Local Element Decl and the
anoymous Simple Type definition.
+* Lexically enclosing default format (object ref)
+* Base simple type definition (object ref), or primitive type (object ref)
+|Local Element Decl with anonymous complex type definition
+|
+* Local properties expressed on the local element decl
+* Lexically enclosing default format (object ref)
+* Anonymous complex type (object ref)
+|Element Reference to Global Element Decl
+|
+* Local properties expressed on the element reference
+* Lexically enclosing default format (object ref)
+* Global Element Decl (object ref)
+|===
+
+Lookups by PropEnv compare sets by equality of members, and object
+references by pointer equality i.e, eq comparison.
+
+This definition of PropEnv can be improved by memoizing the
+construction of default format objects, so that equivalent default
+formats from different schema documents are represented by the same
+object.
+That is, so that multiple schema documents each containing:
+```xml
+ <dfdl:format ref="prefix:formatName"/>
+```
+are instantiated as the same object when the QName resolves to the same format
object.
+
+==== Details of Unique and Shared Regions
+
+The division of the reprsentation into the unique part which appears
+before the shared parts indicates where we can share implementation,
+and where we must have unique context-specific implementation for the
+term.
+
+To understand this better, the below breaks down the sub-regions of a term
further.
+First we look at the details of the unique-before region:
+
+//
+// DITAA max line length (for github) ruler
-------------------------------------------------------------------------|
+//
+
+[ditaa]
+....
+ +--+ +-----------------------------------------------+
+--+
+ | | term | |
+ | | | |
+ | | +---------------+ +--------+ +--------------+ | |
+ ...| | | unique before | | shared | | shared after | |
|...
+ | | +---------------+ +--------+ +--------------+ | |
+ +-+ +-----------------------------------------------+ +-+
+ | |
+ | |
++---------------------------+
+----------------------------------+
+|
|
+v
v
++------------------------------------------------------------------------------+
+| +-------------------+ unique before
|
+| | sequence before |
|
+| | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+
|
+| | |prefix| |prefix| | |lSkip| |alignFill| |init'r| |init'r| |prefix| +value|
|
+| | |infix | |infix | | | | | | | MTA | | | |length| | MTA |
|
+| | |sep'r | |sep'r | | | | | | | | | | | | | |
|
+| | | MTA | | | | | | | | | | | | | | | |
|
+| | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+
|
+| +-------------------+
|
++------------------------------------------------------------------------------+
+
+....
+The sub-regions within the unique before region are:
+
+* prefix infix sep'r MTA - mandatory text alignment for a separator in infix
or prefix position.
+When a separator is defined, this region is populated with up to 7 bits to
+obtain text alignment for the characters of the separator in the specified
charset encoding.
+Note that this encoding and that of the term's initator/terminator are not
necessarily the same encoding.
+* prefix infix sep'r - separator when defined and in prefix or infix position.
+This consists of text characters.
+
+* lSkip - leading skip region.
+This is fixed length (commonly 0)
+* alignFill - alignment fill region.
+This dynamically sized region depends on the bit position where it starts.
+* init'r MTA - mandatory text alignment for initiator.
+When an initiator is defined, this region is populated with up to 7 bits to
+obtain text alignment for the characters of the initiator in specified charset
encoding.
+* init'r - initiator.
+This consists of text characters.
+* prefix length - (element terms only) the prefix length.
+Sub-detail of this region is not shown here.
+The PrefixLengthQuasiElementDecl class is used to represent this subregion in
DSOM.
+* value MTA - (element terms only, simple type only) mandatory text alignment
for the value.
+When the value has text representation this insures those characters start
+on the right bit-boundary for characters of the specified charset encoding.
+
+The sizes of the alignment regions (alignFill, and the MTA regions) in
+the above all depend on where the start of this term is.
+That depends on where the prior term ends, if there is a prior term.
+That is, the sizes of the alignment regions depend on the enclosing
+model groups, and what can appear before this term in that nest.
+
+Now we look at an expansion of the shared regions:
+
+[ditaa]
+....
+
+
+ +--+ +-----------------------------------------------+ +--+
+ | | term | |
+ | | | |
+ | | +---------------+ +--------+ +--------------+ | |
+ ...| | | unique before | | shared | | shared after | | |...
+ | | +---------------+ +--------+ +--------------+ | |
+ +-+ +-----------------------------------------------+ +-+
+ | |
+ | |
+ +-------------------------------+
+-----------------+
+ |
|
+ |
|
+ v
v
+ +---------------------+
+---------------------------------------------------+
+ |shared | | shared after
+---------------------+ |
+ | | | | sequence
after | |
+ | +--------+ +------+ | | +------+ +------+ +-----+ | +-------+
+-------+ | |
+ | | content| |elt or| | | |term'r| |term'r| |tSkip| | |postfix|
|postfix| | |
+ | | or | |choice| | | | MTA | | | | | | | sep'r | |
sep'r | | |
+ | | value | |unused| | | | | | | | | | | MTA | |
| | |
+ | | | | | | | | | | | | | | | | |
| | |
+ | +--------+ +------+ | | +------+ +------+ +-----+ | +-------+
+-------+ | |
+ | | |
+---------------------+ |
+ +---------------------+
+---------------------------------------------------+
+
+ ^
+ |
+ |
+ +
+ Known Alignment Point
+
+....
+
+It is key that the shared region can assume the alignment
+fill and MTA regions have been sized per the requirements of this
+term.
+Hence, the shared region is context independent and its implementation - in
schema
+compiler objects and in runtime structure representation, need only be
represented once.
+
+.Corner Case: Interior Alignment
+[IMPORTANT]
+--
+The semantics here aren't identical to those of macro expansion, but the
difference is hard to
+detect.
+
+The
https://opendfdl.github.io/gwdrp-dfdl-v1.0.5_r11/gwdrp-dfdl-v1.0.5-r11-change-tracking.htm[DFDL
Spec (recent draft)]
+//
+// IN URL below, had to escape the underscore with a \ which is not part of
the actual URL.
+//
+includes a
link:https://opendfdl.github.io/gwdrp-dfdl-v1.0.5_r11/gwdrp-dfdl-v1.0.5-r11-change-tracking.htm#\_Toc27061169[section
(23.6 in the linked draft)] on
+this
+// Hmmm. just writing _interior alignment problem_ didn't work. This
passthrough to HTML will work for sure.
+pass:[<i>interior alignment problem</i>].
+
+The technique for approximating alignment info described here is slighly less
precise in its
+analysis because it resets its knowledge of the alignment at each known
alignment point.
+Whereas pure macro expansion could carry forward knowledge of alignment and
perhaps optimize out more
+alignment fill regions.
+
+Hence, it's possible that the technique here will leave some alignment fill
regions in place and thereby some
+formats will experience circular deadlocks when unparsing due to this
variable-length interior-alignment.
+
+The Daffodil test suite includes examples of this interior alignment that
cause unparser circular deadlocks.
+(see testOutputValueCalcVariableLengthThenAlignmentDeadlock)
+
+--
+
+The shared region contains a text or simple binary value, a nil
representation, or inductively, a sequence of terms.
+When the term is an element of complex type, an element unused region can
follow (depending on the length kind),
+and if the term is a choice with dfdl:choiceLengthKind='explicit' then a
choice unused region can follow.
+
+Then, after the shared part, the shared-after region has these subregions:
+
+* term'r MTA - mandatory text alignment for the terminator.
+* term'r - terminator
+* tSkip - trailing skip region. This is fixed length (commonly 0)
+* postfix sep'r MTA - - mandatory text alignment for a separator in postfix
position.
+When a separator is defined, this region is populated with up to 7 bits to
+obtain text alignment for the characters of the separator in the specified
charset encoding.
+Note that this encoding and that of the term's initator/terminator are not
necessarily the same encoding.
+* postfix sep'r - separator when defined and in postfix position.
+
+The terminator MTA region varies in size based on its starting
+alignment, and so it depends on the size of the shared region.
+
+The postfix separator MTA region similarly varies in size based on where the
+tSkip region ended, and so it depends on the size of the shared region, and
+the sizes of the terminator MTA region, terminator, and tSkip regions.
+
+NOTE: The shared-after region does not have any sub-regions the size
+of which depend on the context.
+However, we separate it from the central shared region in order to
+separate all concerns around alignment and fill into separate regions from the
+central shared region, which could perhaps be called the shared
+content region.
+The central shared region therefore does not deal with alignment at all.
+
+The central idea here is that all the regions to the left (before) the known
alignment point are context dependent.
+That is, whether these alignment fill and mandatory text alignment regions can
be statically determined
+in size is dependent on where the prior term ended.
+
+In contrast, the known alignment point is a fresh start for alignment. The
alignment as required by the alignment properties
+will have been achieved at that point. Hence, from there and to the right
(after), everything can be assessed relative to
+this new fresh anchor.
+
+Inductively, even the context dependent regions are not dependent on terms
very far back.
+Let's consider the regions in a straddle of two adjacent terms in a sequence:
+
+//
+// DITAA max line length (for github) ruler
-------------------------------------------------------------------------|
+//
+[ditaa]
+....
++----------------------------------------------+
+------------------------------------------------
+ Term 1 | | Term 2
+ +-------------------------+ | |
+-------------------------------------+
+ +---+ +---------+ |shared after | | | |unique before
| +---+
+ | | shared | | | | | |
| |shared
+ | | | | +---------+ | | | | +---------+
| |
+ | | +-+ +-+ | | +-+ +-+ +-+ |sequence | | | | | |sequence | +-+ +-+ +-+
+-+ +-+ +-+ | |
+ | | | | | | | | | | | | | | | after | | | | | | before | | | | | | |
| | | | | | | |
+ | | | | | | | | | | | | | | | +-+ +-+ | | | | | | +-+ +-+ | | | | | | |
| | | | | | | |
+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| | | | | | | |
+ ... | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| | | | | | | | ...
+ | | +-+ +-+ | | +-+ +-+ +-+ | | | | | | | | | | | | | | | | +-+ +-+ +-+
+-+ +-+ +-+ | |
+ | | | | | +-+ +-+ | | | | | | +-+ +-+ |
| |
+ +---+ +---------+ | +---------+ | | | | +---------+
| +---+
+ ^ +-------------------------+ | |
+-------------------------------------+ ^
+ | | |
|
++----------------------------------------------+
+-------------------------------------------------
+ |
|
+ |
|
+ |
|
+ |
|
+ +
+
+ Known Alignment Point 1
Known Alignment Point 2
+
+....
+
+The Known Alignment Point 2 is only context dependent on the possible prior
terms back to Known Alignment Point 1.
+This is true certaintly when Term 1 is a scalar element or a model-group
itself.
+
+In reality the induction here is more complex because Term 1 might be an
optional element or an array, in which case the prior term before Term 1 may
also be involved
+in computing the region sizes for the Term 2 unique-before regions.
+But this never needs to consider anything further back than the prior scalar
term, or the start of the shared region of the enclosing model group, which
need
+only be considered if Term 2 can be first within its enclosing model group.
+
+CAUTION: Separators may be present even if terms are not, based on
dfdl:separatorSuppressionPolicy and a few other properties.
+The diagrams show sequence before region as if it was part of the term, but
really it may appear even if the term is absent.
+
+
+
+
+
+Given the above, for any term, we can compute its PropEnv, and create
+a distinct shared instance for that term for each unique PropEnv.
+
+Since the shared region contains all the children of any
+child-containing term, sharing this insures that the schema
+compilation space/speed is proportional to, roughly, the size of the
+schema without any non-constant multiplicative factor.
+This insures no combinatorial explosion of compilation time.
+
+=== Optimizing Alignment Regions
+
+A primary task of the schema compiler is to eliminate alignment fill
+(and mandatory text alignment aka MTA) regions that are unnecessary
+because the data can be proven to always be properly aligned.
+
+This is done by computing, for each region, a compile-time alignment
approximation.
+The AlignmentMultipleOf class represents this information.
+AlignmentMultipleOf objects form a mathematical lattice where perfect
+alignment is the bottom of the lattice, no alignment (or alignment
+multiple of 1 bit) is the top of the lattice, and various multiples
+(8, 32, etc.) live in between. For two lattice values a and b, we say
+a < b (a is 'weaker than' b) if a is a multiple of b.
+
+A key observation is this: Each time we create an alignment constraint
+for a shared region, this is not dependent on any other constraint.
+That is, the constraints on alignment do not propagate from term to term.
+A shared region can assume it starts at the alignment that is required
+for the term based on the dfdl:alignment property, the initiator MTA
+if there is an initiator, and the value MTA for text-representation
+simple values.
+
+Hence, terms that appear down-stream of any shared region do *not*
+depend on any constraints propagating from before the shared region.
+This greatly reduces the complexity and the distance across the schema
+that constraint-changes propagate.
+It also insures there is no need for an iterative contraint solver, since
nothing
+about alignment propagates from one term to the next.
+Terms are separated by the fact that the shared region of a term starts from a
+known alignment (specified by its alignment and alignment units
+property).
+
+
+=== Other Context-Dependent Computations
+
+By dividing up the representation of a term into the unique context-dependent
+and shared context-indepenent parts explicitly we provide a home for the
various
+calculations the schema compiler performs in order to do other optimizations.
+
+
+[plantuml, target="unshared-shared", format="png"]
+....
+hide empty members
+
+class UnsharedTerm {
+
+ isAlignmentFillNeeded: Boolean
+ isInitiatorMTANeeded: Boolean
+ isValueMTANeeded:Boolean
+
+ isPrefixInfixSeparatorMTANeeded: Boolean
+
+ isPotentiallyTrailing: Boolean
+
+ needsBitOrderChange: Boolean
+
+ isScanningForDelimiter: Boolean
+ isScanningForTerminator: Boolean
+
+ initiatedContentCheck: Unit
+
+}
+
+class SharedTerm {
+ isTerminatorMTANeeded: Boolean
+ isPostfixSeparatorMTANeeded: Boolean
+ // along with everything else
+
+}
+
+UnsharedTerm -right-> SharedTerm : shared
+
+....
+
+In the above class diagram, members that must be computed on the unshared term
object are
+shown in the class.
+This set of things should always be minimized.
+That is, as few things should be computed on unshared term objects as posible.
+It is assumed that everything else is computed on the shared term object, but
members
+specifically about regions discussed in sections above are called out.
+
+The UnsharedTerm object is effectively state of the enclosing model group DSOM
object.
+It exists in 1 to 1 correspondence (aka it is "owned" by the enclosing model
group object.)
+
+In the Runtime 1 backend, a processor (parser/unparser) is generated for the
shared term object,
+and that processor is referenced from the processor generated for the unshared
term object.
+
+== Transition Plan
+
+IMPORTANT: Once implemented this section can be deleted as it is describing
transition from an older Daffodil version to one that implements the shared
objects.
+
+We can examine several of the context-dependent calculations below.
+
+==== isAlignmentFillNeeded, isInitiatorMTANeeded, isTerminatorMTANeeded,
isPrefixInfixSeparatorMTANeeded, isPostfixSeparatorMTANeeded
+
+These are the output of the analysis of alignments, and they turn on/off the
presence of these regions.
+These are of particular importance to a streaming unparser since they can
require suspension of unparsing until alignment can be determined at runtime.
+
+==== isPotentiallyTrailing
+
+This is context dependent because a given term definition can be reused in
multiple contexts, some of which where it is potentially trailing, and others
of which where it is not.
+
+==== needsBitOrderChange
+
+This is used to optimize out suspensions in the unparser. It is context
dependent because whether the bit order is changing when the processor arrives
on a term depends on what preceded it in the processing.
+
+==== isScanningForDelimiter
+
+This is used to determine whether a delimiter stack combinator is used as part
of the grammar. It is context dependent because a separator or terminator can
be defined on an enclosing model group.
+
+TBD: It's possible we don't need to wrap this combinator around elements.
Manipulating the delimiter stack should only be necessary when delimiters go
into and out of scope. So that would mean once per model group, not per term
within the model group.
+
+==== isScanningForTerminator
+
+This determines whether the schema compiler will allow the entity "%ES;" as a
runtime value for the expression value of a dfdl:terminator property.
+Specifically if we are scanning for a terminator then the "%ES;" entity is
disallowed.
+
+This is context dependent because the terminator may be defined on an
enclosing model group, not on an element that is within that model group.
+The element may have a position in the model group such that it may be last,
and in that case if it has a length kind that does not use the terminator
(e.g., dfdl:lengthKind='pattern') then that terminator is not being scanned
for, and so it is allowable for it to be empty string at runtime.
+
+If this computation is done on the unshared term, it must inquire about many
characteristics of the enclosing model group.
+It may be that this computation is best done on the model-group object.
+
+==== initiatedContent Check
+
+When a model group has the dfdl:initiatedContent="yes" property, all
represented term children must have a defined initiator.
+
+This is context dependent because a global element declaration or a global
group definition may define an initiator, but the model group with initiated
content may refer to these via an element reference or group reference.
+That is, the relationship is not one of lexical enclosure.
+The check must be done as part of the unshared term, or on the enclosing model
group itself.
+
+=== Refactoring of Primary DSOM Classes
+
+Implementing this sharing requires splitting the DSOM Term clases each into a
unique and a shared part.
+
+This will be done by renaming each Term to SharedTerm, and introducing
UniqueTerm.
+
+For example ElementBase will be renamed to SharedElementBase, and new abstract
class UniqueElementBase will be created.
+
+This will occur uniformly across the DSOM Term subclasses/traits.
+
+Some of the mixins to these classes will go on only one or the other of the
shared or unique parts.
+
+Many of the mixins will have to split into a mixin for the shared part and
another mixin for the unique part.
+One example of this is the AlignedMixin. This currently deals with alignment
regions that are found in unique parts and shared parts, and those
+methods must go on different objects.
+
+The unique term classes will utilize a central memoizing factory to allocate
or find shared term instances based on having equivalent PropEnvs.
+
+
+
+
+=== Testing Notes
+
+
+==== Compiler Instrumentation
+
+Some standard compiler instrumentation that is dumped out based on options to
daffodil compiler, to include:
+
+* counters of number of distinct shared terms for each global def/decl.
+* counters of number of inbound references to each shared term.
+
+The point is to have metrics that can be compared and which will make it clear
if under maintenance the sharing is somehow lost and we're getting bad
compile-time behavior again.