Discussion topics around reducing the memory footprint of the Daffodil Schema
Compiler.
Issue: combinatorial explosion of duplicated objects in compiler, all of which
have to be compiled. Kind of a silly mistake, but the fact that the Daffodil
schema compiler is effectively a big functional program means you'll get the
same function regardless of this copying. We just do a lot of recomputations of
the same thing.
Fix: stop doing that. Share objects. Max number of DSOM objects should be
1-to-1 equal to number of XML elements, types, groups in the DFDL schema. Can
be fewer than that given optimizations.
This sharing requires undoing the unique backpointer that objects have to their
referencing context. Lexical parent is still unique, but the containing
references, for a shared object there is a set of such places.
Any calculations in the Schema Compiler that depend on traversing "The" unique
backpointer have to be removed. If you just delete the backpointer, and then
delete all the code that breaks, and then the code that removal breaks, etc.
you will eventually remove ALL code because it all boils down to properties,
and today even those are using the backpointer. So the property resolution is
job 1 to fix in the compiler.
Any runtime mechanisms that depend on such calculations must be removed. E.g.,
slots, alignment inference (are we pre-aligned based on what is before us, or
do we have to do alignment), change parsers/unparsers that only set the
encoding into the I/O layer if it determines the encoding may have changed.
Those kinds of analysis would have to be generalized to take all contexts into
account, as shared objects will have multiple back-refs to contexts that share
them. For the most part this is not worth it.
Side benefit of fixing this sharing in the compiler is that recursive schemas
should work - this is a useful feature (though an extension to DFDL) enabling
description of document formats and other kinds of data with recursive
structure. We will still want a flag to turn this on/off and a check to insist
on non-recursive schemas.
Runtime Changes
* Slots - remove and use QNames
* note optimization possible by interning QNames - would avoid string
comparisons of element name. Note that namespaces are already interned so can
be compared via EQ comparisons.
* Unordered sequence combinator can keep state in PState/Ustate enabling
gathering of incoming elements into proper buckets.
* We considered whether QNames were enough or a QName plus a number (to
make it unique within a parent element) is needed. Concluded just QName is
correct.
* DPath expressions are effectively lists of upward moves and downward
moves specified by QNames.
* Taylor Wise has code for this almost done.
* Change-parsers/unparsers - eliminate in favor of I/O layer callback to
PState/UState to obtain properties needed.
* bitOrderChangeUnparser - still needed - because it behaves like
alignment unparser, but see below - we're getting rid of alignment
parsers/unparsers too.
* Mike Beckerle has code for this working except for
bitOrderChangeUnparser and alignment unparsers are still there.
* Alignment
* SDE on outputValueCalc (OVC) elements w/o Specified Length - so much
complexity occurs if the OVC element's length isn't known, that it's just not
worth it. A restriction requiring specified length massively simplifies
unparser implementation. The only suspensions are for the outputValueCalc
elements themselves, which have expressions that may block until later infoset
objects arrive at the unparser so their values or length can be used in the OVC
calculation.
* If you do this, alignment operations cannot suspend (nor can
bitOrderChangeUnparser)
* Alignment operations can all move to Term and text (mandatory text
alignment) parsers/unparsers.
* Aligning is basically one line of code, so the overhead is negligible.
* Delimiters(?)
* Because delimiters are pushed on stacks now, they aren't incorporated
directly into the definition of a parser. That is, a parser doesn't have to
look outward at its containment separators and terminators all the way outward,
to get the complete set of things that might delimit it. It's done at runtime
by pushing them on a stack as scopes are entered.
* Smaller things: there aren't unique paths from root to every element in
the ERD graph any more.
* So we can't use those paths as unique identifiers anymore because ERDs
will be shared now.
Compiler Changes
* DSOM - refactor object model.
* See UML diagram sketch (Attached.)
* The part of this sketch around ModelGroups isn't right. It should be
analogous to the structure for Elements and ElementRefs.
* DSOM - Remove back pointers. Replace with shared objects having set of
back-refs.
* algorithm to construct all objects and backpointer sets will be a
first pass over everything to create this linked tree with sharing.
* after that, no more DSOM objects will be created.
* Property resolution w/o using backpointers
* Right now, elements reach over to the referencing element ref via
backpointer (if there is one) to grab properties.
* This needs to change to the ElementRef being the place resolving the
properties.
* Mike Beckerle - already has this code working - has to separate it
from other unsuccessful changes in a branch.
* Expressions with up-and-out relative paths must be type-checked for all
back-refs
* means all back-refs must be in place before such compilation is begun
* The actual compiled representation of the path is invariant of context
location.
* Issue: same path names, but different types requiring different
casts/conversions - or do we say one-path means one type signature? There's a
possibility an up-and-out path would be used polymorphically for different
types, but still be type-correct in each situation.
* Delimiters (?)
* Hoisting of property-dependent code off of type-objects and
global-element-decl and global group def
* All property-dependent code goes onto Term (or TermCodeGen) class.
* Note: An element cannot ask its complex type for children, and then
generate code for them, as that would just put back all the copying, just in
the parser/unparser generation phase. Instead an element must ask for the
shared parsers/unparsers of the children of its type.
* This works because complex types and their model groups do not
combine properties with the element referencing them. So they can always be
shared.
* However, this lack of combining for complex types with their
elements is asymmetric with how simple types and elements work, and it has been
proposed to lift this restriction. In that case, one must cache compiled DSOM
objects (parsers unparsers) based on the property environment they are in. When
any object is compiled and the property environment is the same, the compiled
object can be shared because any differences in the compiler output must come
based on the properties.
* Optimization: ElementRef/TypeRef/GroupRef consolidation
* if two elementRefs have the same ref, and same properties and same
default properties, then they can share representation.
* The new schema compiler should not generate element-ref location
specific code.
* Similarly for Elements (aka TypeRefs), and GroupRefs.
* This can result in the number of DSOM objects actually being
substantially smaller than the number of XML elements expressed textually in
the DFDL schema.
* Feature Needed: Compile all elements as root in one compiled object for
the schema. No need to pre-identify the root element before compiling.
* except those that have upward-and-out expressions - cannot be root.
* Need to throw specific exceptions that can be caught to disqualify
an element for root use while still allowing it for use from element references.
* A Root really behaves like an element reference.
* In the new UML design, ElementRefs and LocalElementDecls can generate
parser/unparsers, but GlobalElementDecls cannot, so one needs to create a Root
for each GlobalElementDecl to allow it to be used as a Root.
* Paths - can't trace backpointers anymore - so cannot be used as unique
identifiers of a context any more.