Notes on Daffodil Schema Compiler redesign - improve footprint and speed

Mike Beckerle Fri, 20 Oct 2017 18:44:22 -0700

Discussion topics around reducing the memory footprint of the Daffodil Schema 
Compiler.



Issue: combinatorial explosion of duplicated objects in compiler, all of which 
have to be compiled. Kind of a silly mistake, but the fact that the Daffodil 
schema compiler is effectively a big functional program means you'll get the 
same function regardless of this copying. We just do a lot of recomputations of 
the same thing.


Fix: stop doing that. Share objects. Max number of DSOM objects should be 
1-to-1 equal to number of XML elements, types, groups in the DFDL schema. Can 
be fewer than that given optimizations.


This sharing requires undoing the unique backpointer that objects have to their 
referencing context. Lexical parent is still unique, but the containing 
references, for a shared object there is a set of such places.


Any calculations in the Schema Compiler that depend on traversing "The" unique 
backpointer have to be removed.  If you just delete the backpointer, and then 
delete all the code that breaks, and then the code that removal breaks, etc. 
you will eventually remove ALL code because it all boils down to properties, 
and today even those are using the backpointer. So the property resolution is 
job 1 to fix in the compiler.


Any runtime mechanisms that depend on such calculations must be removed. E.g., 
slots, alignment inference (are we pre-aligned based on what is before us, or 
do we have to do alignment), change parsers/unparsers that only set the 
encoding into the I/O layer if it determines the encoding may have changed. 
Those kinds of analysis would have to be generalized to take all contexts into 
account, as shared objects will have multiple back-refs to contexts that share 
them. For the most part this is not worth it.


Side benefit of fixing this sharing in the compiler is that recursive schemas 
should work - this is a useful feature (though an extension to DFDL) enabling 
description of document formats and other kinds of data with recursive 
structure. We will still want a flag to turn this on/off and a check to insist 
on non-recursive schemas.



Runtime Changes


  *   Slots - remove and use QNames
     *   note optimization possible by interning QNames - would avoid string 
comparisons of element name. Note that namespaces are already interned so can 
be compared via EQ comparisons.
     *   Unordered sequence combinator can keep state in PState/Ustate enabling 
gathering of incoming elements into proper buckets.
     *   We considered whether QNames were enough or a QName plus a number (to 
make it unique within a parent element) is needed. Concluded just QName is 
correct.
     *   DPath expressions are effectively lists of upward moves and downward 
moves specified by QNames.
     *   Taylor Wise has code for this almost done.
  *   Change-parsers/unparsers - eliminate in favor of I/O layer callback to 
PState/UState to obtain properties needed.
     *   bitOrderChangeUnparser - still needed - because it behaves like 
alignment unparser, but see below - we're getting rid of alignment 
parsers/unparsers too.
     *   Mike Beckerle has code for this working except for 
bitOrderChangeUnparser and alignment unparsers are still there.
  *   Alignment
     *   SDE on outputValueCalc (OVC) elements w/o Specified Length - so much 
complexity occurs if the OVC element's length isn't known, that it's just not 
worth it. A restriction requiring specified length massively simplifies 
unparser implementation. The only suspensions are for the outputValueCalc 
elements themselves, which have expressions that may block until later infoset 
objects arrive at the unparser so their values or length can be used in the OVC 
calculation.
     *   If you do this, alignment operations cannot suspend (nor can 
bitOrderChangeUnparser)
     *   Alignment operations can all move to Term and text (mandatory text 
alignment) parsers/unparsers.
     *   Aligning is basically one line of code, so the overhead is negligible.
  *   Delimiters(?)
     *   Because delimiters are pushed on stacks now, they aren't incorporated 
directly into the definition of a parser. That is, a parser doesn't have to 
look outward at its containment separators and terminators all the way outward, 
to get the complete set of things that might delimit it. It's done at runtime 
by pushing them on a stack as scopes are entered.
  *   Smaller things: there aren't unique paths from root to every element in 
the ERD graph any more.
     *   So we can't use those paths as unique identifiers anymore because ERDs 
will be shared now.


Compiler Changes


  *   DSOM - refactor object model.
     *   See UML diagram sketch (Attached.)
     *   The part of this sketch around ModelGroups isn't right. It should be 
analogous to the structure for Elements and ElementRefs.
  *   DSOM - Remove back pointers. Replace with shared objects having set of 
back-refs.
     *   algorithm to construct all objects and backpointer sets will be a 
first pass over everything to create this linked tree with sharing.
     *   after that, no more DSOM objects will be created.
  *   Property resolution w/o using backpointers
     *   Right now, elements reach over to the referencing element ref via 
backpointer (if there is one) to grab properties.
     *   This needs to change to the ElementRef being the place resolving the 
properties.
     *   Mike Beckerle - already has this code working - has to separate it 
from other unsuccessful changes in a branch.
  *   Expressions with up-and-out relative paths must be type-checked for all 
back-refs
     *   means all back-refs must be in place before such compilation is begun
     *   The actual compiled representation of the path is invariant of context 
location.
     *   Issue: same path names, but different types requiring different 
casts/conversions - or do we say one-path means one type signature? There's a 
possibility an up-and-out path would be used polymorphically for different 
types, but still be type-correct in each situation.
  *   Delimiters (?)
  *   Hoisting of property-dependent code off of type-objects and 
global-element-decl and global group def
     *   All property-dependent code goes onto Term (or TermCodeGen) class.
     *   Note: An element cannot ask its complex type for children, and then 
generate code for them, as that would just put back all the copying, just in 
the parser/unparser generation phase. Instead an element must ask for the 
shared parsers/unparsers of the children of its type.
        *   This works because complex types and their model groups do not 
combine properties with the element referencing them. So they can always be 
shared.
        *   However, this lack of combining for complex types with their 
elements is asymmetric with how simple types and elements work, and it has been 
proposed to lift this restriction. In that case, one must cache compiled DSOM 
objects (parsers unparsers) based on the property environment they are in. When 
any object is compiled and the property environment is the same, the compiled 
object can be shared because any differences in the compiler output must come 
based on the properties.
  *   Optimization: ElementRef/TypeRef/GroupRef consolidation
     *   if two elementRefs have the same ref, and same properties and same 
default properties, then they can share representation.
     *   The new schema compiler should not generate element-ref location 
specific code.
     *   Similarly for Elements (aka TypeRefs), and GroupRefs.
     *   This can result in the number of DSOM objects actually being 
substantially smaller than the number of XML elements expressed textually in 
the DFDL schema.
  *   Feature Needed: Compile all elements as root in one compiled object for 
the schema. No need to pre-identify the root element before compiling.
     *   except those that have upward-and-out expressions - cannot be root.
        *   Need to throw specific exceptions that can be caught to disqualify 
an element for root use while still allowing it for use from element references.
     *   A Root really behaves like an element reference.
     *   In the new UML design, ElementRefs and LocalElementDecls can generate 
parser/unparsers, but GlobalElementDecls cannot, so one needs to create a Root 
for each GlobalElementDecl to allow it to be used as a Root.
  *   Paths - can't trace backpointers anymore - so cannot be used as unique 
identifiers of a context any more.

Notes on Daffodil Schema Compiler redesign - improve footprint and speed

Reply via email to