mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r376431267
########## File path: site/dev/design-notes/namespace-binding-minimization.adoc ########## @@ -0,0 +1,341 @@ +:page-layout: page +:keywords: schema-compiler performance alignment optimization +// /////////////////////////////////////////////////////////////////////////// +// +// This file is written in AsciiDoc. +// +// If you can read this comment, your browser is not rendering asciidoc automatically. +// +// You need to install the asciidoc plugin to Chrome or Firefox +// so that this page will be properly rendered for your viewing pleasure. +// +// You can get the plugins by searching the web for 'asciidoc plugin' +// +// You will want to change plugin settings to enable diagrams (they're off by default.) +// +// You need to view this page with Chrome or Firefox. +// +// /////////////////////////////////////////////////////////////////////////// +// +// When editing, please start each sentence on a new line. +// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.] +// This makes textual diffs of this file useful in a similar way to the way they work for code. +// +// ////////////////////////////////////////////////////////////////////////// + +== Namespace Binding Minimization + +=== Introduction + +DFDL schemas are XML schemas and so DFDL inherits the namespace system of XML and XML Schema for composing large schemas from smaller ones, for reusing schema files, and for managing name conflicts. + +A DFDL Infoset isn't necessarily represented as XML however. +Some representations won't have any ability to deal with namespaces (JSON for example), and so Daffodil will sometimes issue warnings when compiling a schema if the namespace usage will not allow unambiguous representation without namespaces. + +Most representations of DFDL Infosets will, like XML, use some representation of the namespaces of elements, and in textual forms this will most commonly be by way of namespace prefixes. +XML is not the only representation that uses namespaces, however, so this should not be taken as an entirely XML-specific discussion. + +There are these goals for namespace-binding minimization. + +. Clarity: Infosets that have redundant namespace bindings are very hard to read and understand, and require namespace-binding-aware tooling to compare them, or clumsy post-processing to remove the excess bindings. + +. Performance: Attaching an element to the infoset at runtime should take constant time. + +. Consistency: The prefix-to-namespace bindings used should be drawn from those expressed on the DFDL schema by the schema author, and the prefixes used and bindings introduced when an element is attached to the infoset should be consistent with the set of namespace prefix definitions in place at the point where the element's declaration lexically appears in the DFDL schema. + +These goals are in some tension. +Consider 4 elements named A, B, C, and Q. +Suppose element A contains element B, which contains element Q. +Suppose elsewhere in the same infoset element A contains element C which contains element Q. +From the perspective of element Q, the set of namespace bindings surrounding it are those from (A, B) or those from (A, C). +Suppose element Q requires, and introduces, a namespace with prefix "qns" bound to namespace "urn:Q_Namespace". +Suppose element C also introduces this same namespace binding. +Then when element Q appears inside element B, its namespace binding for "qns" is needed. +But when element Q appears inside element C, its namespace binding for "qns" is redundant with one already provided by element C. + +The conclusion is that the minmal set of namespace bindings introduced by an element depends on the nesting of elements. + +The basic technique is to store, on the runtime element data structure (DPathElementCompileInfo), the complete set of lexical namespace bindings present for the element declaration in the DFDL schema document. +==== Namespace Bindings come from the Element Declarations, not Element References + +Consider two schema files: + +```xml +<!-- file foo.dfdl.xsd --> +<schema + xmlns:pre1="namespace1" + xmlns:ns1="differentNamespace"> + <import namespace="namespace1" schemaFileLocation="bar.dfdl.xsd"/> + ... + <element name="root"> + <complexType> + <sequence> + <element ref="pre1:foo" maxOccurs="unbounded"/> + </sequence> + </complexType> + </element> + +</schema> + +<!-- file bar.dfdl.xsd --> +<schema targetNamespace="namespace1" + xmlns:ns1="namespace1" + xmlns:pre1="someOtherNamespace"> + + <element name="foo" ..../> +</schema> +``` +In the above we have a conflict over the use of the prefix "pre1". +Now consider an XML document corresponding to this with element 'foo' inside the 'root' element: + +```xml +<root xmlns:pre1="namespace1" + xmlns:ns1="differentNamespace"> + ... + <ns1:foo + xmlns:ns1="namespace1" + xmlns:pre1="someOtherNamespace"> + ... + </ns1:foo> + ... +</root> +``` + +Notice that element 'foo' appears inside 'root' using the "ns1" prefix but it also introduces a binding for that prefix which supercedes that of the enclosing environment. +The prefix "pre1" cannot be used for element 'foo' because in the namespace bindings of the bar.dfdl.xsd schema document, the "pre1" prefix is bound to "someOtherNamespace". + +This example illustrates that each element must use, and introduce, the lexically defined prefixes from the point where the element is declared. +Not from the point of element reference. + +Since element 'foo' is recurring, it's unfortunate, but every single instance will, textualized, carry these namespace bindings. +E.g., + +```xml +<root xmlns:pre1="namespace1" + xmlns:ns1="differentNamespace"> + ... + <ns1:foo + xmlns:ns1="namespace1" + xmlns:pre1="someOtherNamespace"> + ... + </ns1:foo> + <ns1:foo + xmlns:ns1="namespace1" + xmlns:pre1="someOtherNamespace"> + ... + </ns1:foo> + <ns1:foo + xmlns:ns1="namespace1" + xmlns:pre1="someOtherNamespace"> + ... + </ns1:foo> + + ... +</root> +``` + +This problem is not one Daffodil strives to solve. +The schema author can simply avoid these sorts of name clashes and this problem will not occur. +Automatic renaming of prefixes to avoid this problem is considered unwarranted, as it will confuse users. + + + +=== Namespace Minimization + +==== Only Element Namespace Prefix Bindings + +Only namespace definitions associated with element declarations need to ever be considered for the infoset. +Namespace definitions that define prefixes used for type, group, format, or escapeScheme references are not included +in the namespace definitions carried on infoset elements. + +==== Avoid Prefix "tns" (or Other Common Ambiguous Names) When Possible + +Many DFDL schemas will define prefix "tns" to be that schema document's target namespace. + +This same problem could occur for other prefixes. The "tns" convention is just a common one. + +If the prefix "tns" is ambiguous across the schema set (also used by other schema documents, but for different namespaces), +then its use is undesirable. + +If a schema document defines both "tns" and other prefixes for the target namespace, then another prefix is preferred for +use as the prefix of elements created from declarations in that schema document. + +This situation arises commonly for the default namespace (no prefix, defined by `xmlns="namespaceURI"`). If +this is ambiguous across the schema set (highly likely), then an available alternative prefix (from that schema document) +is preferred. +There is actually no difference between using "tns" and the default namespace. Both are just commonly used, and frequently ambiguous across the schema set. + +(This all generalizes to any prefix which is ambiguous across the schema set.) + +==== Corner Cases + +There are numerous ways schema authors can use and reuse namespace prefixes that can lead to cluttered infosets. + +Other than minor heuristics to choose among alternative available prefix definitions, Daffodil does not try to improve on the +namespace prefix problem on behalf of schema authors. + +===== No Prefix At All +A legal schema document can define a target namespace and define no prefix for it at all. + +In this case, the only way elements of that schema document can be used is some other schema document must provide a prefix definition. +Daffodil chooses a prefix from those available in the schema set (deterministically - e.g., shortest prefix, with ties resolved by alphanumeric order, avoiding ambiguous prefixes like "tns"). + Review comment: The above won't work if the schema defines targetNamespace="target1", no prefix for the target namspace, but it does provide a namespace binding for xmlns:foo="someNamespace", Then outside, another schema imports this one, but defines xmlns:foo="target1". So we can't use namespace prefix "foo:" because it conflicts with the internal version. This isn't likely, but could happen. So we have to check and either fail in this case, generate our own prefix, or just SDE on this whole class of issue i.e., just say no to schemas with target namespaces that have no prefix. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
