tuxji commented on code in PR #121: URL: https://github.com/apache/daffodil-site/pull/121#discussion_r1419719312
########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 Review Comment: Should you update the date since you changed the file today? ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. Review Comment: I had to google to find out that PSVI means "post schema validation infoset". I suggest you use these words, add PSVI in parentheses following it, and hyperlink PSVI to https://www.w3.org/XML/2002/05/psvi-use-cases which gives an idea of what PSVI means. ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. Review Comment: I became concerned when I saw discriminators mentioned since you had just said DFDL properties cannot be expressed on group references. You may want to add a clarification that discriminators are DFDL statements and therefore allowed in groups but not allowed in group references, or whatever it is you meant when you mentioned discriminators. ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> Review Comment: Does this snippet omit any attributes which tell Daffodil which choice branch to take, or is the schema relying on Daffodil getting a parse error in hdr_version_C_type, backtracking to the choice, and parsing hdr_version_D_type instead? You may want to clarify which is the case in the sentence above introducing the snippet. ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + + +### Nillable Simple Types Only (TBD: May not be necessary) + +Nillable is allowed only for simple type elements. + +### Element Name/Identifier Restrictions + +Element names must consist of all non-whitespace characters from the +Unicode basic multilingual plane (no surrogate pairs in element names). + +They may not contain any control characters (Uncode class Cc) may not contain various punctuation +characters (Uncode class Ps, Pe, Pd, Pc, Pf, Pi, Po, nor $). + +This is a lowest-common denominator of identifier rules intended to allow +DFDL schema identifiers to be mapped into ANY programming language or +structure declaration language, while at the same time allowing use of +Unicode characters. + +Element names may not begin with a digit. + +Users are encourated to use "A-Za-Z0-9" only as some systems may not allow use of unicode in +identifiers. + +Element names may not begin with any prefix defined as part of the schema +followed by an "_" as this could be ambiguous with names being made globally +unique by appending prefix, "_" and local name. + +### String Content Restrictions + +Schemas may only be written in UTF-8 encoding. + +The DFDL property dfdl:utf16Width must be 'fixed'. + +### Import `schemaLocation` + +Imported files - a single unique file must be used when importing a namespace. +A single schema may not contain different import statements for the same +namespace but which specify different files. + +This is a practical requirement in Apache Daffodil today, but should be made explicit. + +The `schemaLocation` - if it begins with a "/" it is interpreted as an absolute path, otherwise a Review Comment: *If the `schemaLocation` begins with... ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + + +### Nillable Simple Types Only (TBD: May not be necessary) + +Nillable is allowed only for simple type elements. + +### Element Name/Identifier Restrictions + +Element names must consist of all non-whitespace characters from the +Unicode basic multilingual plane (no surrogate pairs in element names). + +They may not contain any control characters (Uncode class Cc) may not contain various punctuation +characters (Uncode class Ps, Pe, Pd, Pc, Pf, Pi, Po, nor $). + +This is a lowest-common denominator of identifier rules intended to allow +DFDL schema identifiers to be mapped into ANY programming language or +structure declaration language, while at the same time allowing use of +Unicode characters. + +Element names may not begin with a digit. + +Users are encourated to use "A-Za-Z0-9" only as some systems may not allow use of unicode in +identifiers. + +Element names may not begin with any prefix defined as part of the schema +followed by an "_" as this could be ambiguous with names being made globally +unique by appending prefix, "_" and local name. + +### String Content Restrictions + +Schemas may only be written in UTF-8 encoding. + +The DFDL property dfdl:utf16Width must be 'fixed'. + +### Import `schemaLocation` + +Imported files - a single unique file must be used when importing a namespace. +A single schema may not contain different import statements for the same +namespace but which specify different files. Review Comment: Omit "but" ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. Review Comment: Better worded as: standard profile as a subset of DFDL. ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion Review Comment: Insert a blank line after this line to avoid a markdownlint warning. ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a Review Comment: Do you want to spell out VMF and/or hyperlink VMF to its github repo so people can look at it? ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + + +### Nillable Simple Types Only (TBD: May not be necessary) + +Nillable is allowed only for simple type elements. + +### Element Name/Identifier Restrictions + +Element names must consist of all non-whitespace characters from the +Unicode basic multilingual plane (no surrogate pairs in element names). + +They may not contain any control characters (Uncode class Cc) may not contain various punctuation Review Comment: *[and] may not.... ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + Review Comment: Yet another way might be possible if you use choice dispatch keys, which simplifies knowing which choice branch in take in every place without relying on backtracking. You could increase the number of choice elements and make them finer-grained so that the elements `C` and `D` are in one choice, the common fields in `hdr_version_*_type` are not in any choice, and then any different fields in `hdr_version_*_type` are in their own choices as well. If you need to reuse common definitions in multiple places, you can put them in groups and reference them with group references as well. ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + + +### Nillable Simple Types Only (TBD: May not be necessary) + +Nillable is allowed only for simple type elements. + +### Element Name/Identifier Restrictions + +Element names must consist of all non-whitespace characters from the +Unicode basic multilingual plane (no surrogate pairs in element names). + +They may not contain any control characters (Uncode class Cc) may not contain various punctuation +characters (Uncode class Ps, Pe, Pd, Pc, Pf, Pi, Po, nor $). + +This is a lowest-common denominator of identifier rules intended to allow +DFDL schema identifiers to be mapped into ANY programming language or +structure declaration language, while at the same time allowing use of +Unicode characters. + +Element names may not begin with a digit. + +Users are encourated to use "A-Za-Z0-9" only as some systems may not allow use of unicode in Review Comment: *Unicode characters ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + + +### Nillable Simple Types Only (TBD: May not be necessary) + +Nillable is allowed only for simple type elements. + +### Element Name/Identifier Restrictions + +Element names must consist of all non-whitespace characters from the +Unicode basic multilingual plane (no surrogate pairs in element names). + +They may not contain any control characters (Uncode class Cc) may not contain various punctuation +characters (Uncode class Ps, Pe, Pd, Pc, Pf, Pi, Po, nor $). + +This is a lowest-common denominator of identifier rules intended to allow +DFDL schema identifiers to be mapped into ANY programming language or +structure declaration language, while at the same time allowing use of +Unicode characters. + +Element names may not begin with a digit. + +Users are encourated to use "A-Za-Z0-9" only as some systems may not allow use of unicode in +identifiers. + +Element names may not begin with any prefix defined as part of the schema +followed by an "_" as this could be ambiguous with names being made globally +unique by appending prefix, "_" and local name. + +### String Content Restrictions + +Schemas may only be written in UTF-8 encoding. + +The DFDL property dfdl:utf16Width must be 'fixed'. + +### Import `schemaLocation` + +Imported files - a single unique file must be used when importing a namespace. Review Comment: *A namespace must be imported from a single unique imported file. ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + + +### Nillable Simple Types Only (TBD: May not be necessary) + +Nillable is allowed only for simple type elements. Review Comment: Please clarify which simple type elements are nillable. I think numbers by themselves can never be nillable unless you're talking about optional simple type elements at certain places in an enclosing complex type? ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + + +### Nillable Simple Types Only (TBD: May not be necessary) + +Nillable is allowed only for simple type elements. + +### Element Name/Identifier Restrictions + +Element names must consist of all non-whitespace characters from the +Unicode basic multilingual plane (no surrogate pairs in element names). + +They may not contain any control characters (Uncode class Cc) may not contain various punctuation +characters (Uncode class Ps, Pe, Pd, Pc, Pf, Pi, Po, nor $). + +This is a lowest-common denominator of identifier rules intended to allow +DFDL schema identifiers to be mapped into ANY programming language or +structure declaration language, while at the same time allowing use of +Unicode characters. + +Element names may not begin with a digit. + +Users are encourated to use "A-Za-Z0-9" only as some systems may not allow use of unicode in +identifiers. + +Element names may not begin with any prefix defined as part of the schema +followed by an "_" as this could be ambiguous with names being made globally +unique by appending prefix, "_" and local name. + +### String Content Restrictions + +Schemas may only be written in UTF-8 encoding. + +The DFDL property dfdl:utf16Width must be 'fixed'. + +### Import `schemaLocation` + +Imported files - a single unique file must be used when importing a namespace. +A single schema may not contain different import statements for the same +namespace but which specify different files. + +This is a practical requirement in Apache Daffodil today, but should be made explicit. + +The `schemaLocation` - if it begins with a "/" it is interpreted as an absolute path, otherwise a +relative path. Both may be interpreted relative to a classpath. + +## Existing DFDL Restrictions + +Just as a reminder, the above standard-profile restrictions go on top of DFDL +existing limitations on XML Schema such as: + +- arrays/optional - only for elements +- no mixed content +- no complex type derivation +- no attributes +- limited set of simple types +- pattern facets only for xs:string elements +- other facet restrictions by type + +## Possible additional restrictions + +### Troublesome Placement of dfdl:assert and dfdl:discriminator on Sequences & Choices + +The DFDL v1.0 rules about sequences/choices and statement annotations on them are confusing. +In particular, a dfdl:assert or dfdl:discriminator with testKind 'expression' appears lexically at +the top of the sequence/choice, but is executed after the sequence/choice content has been parsed. + +This is sufficiently error-prone that the standard profile should disallow +it, requiring instead that an inner sequence carrying the assertion or +discriminator with NO child content, be inserted in the sequence at the +point where the evaluation is required to occur. + +# Enabling the Standard Profile + +The following ways should be available for a schema author to tell Daffodil they want enforcement of +the standard profile (or not). Review Comment: I see so many ways below that I think we need use cases justifying the need for implementing each way. Can we drop any of these ways without losing any use cases? ########## site/dev/design-notes/Proposed-DFDL-Standard-Profile.md: ########## @@ -0,0 +1,214 @@ +# Proposal: DFDL Standard Profile + +#### Version 0.2 2023-10-18 + +## Introduction + +In attempting to integrate Apache Daffodil with other data processing software, the need to make +DFDL schemas interoperate properly in conjunction with other data models has arisen. + +Other tools such as Apache NiFi, Apache Drill, Apache Spark, etc. have data models which are +powerful, but not as expressive as DFDL. + +DFDL's data model is a simplification of XML Schemas's PSVI; however, even this causes problems. +Most other data processing systems were not designed with markup languages in mind, but rather for +structured data. + +The following things are allowed in DFDL v1.0, but are difficult to map into most data models: + +- anonymous choices +- duplicate element child names +- namespaces that are different, but where the prefixes are not unique +- global names for element children + +A more restrictive subset of DFDL, a _standard profile_, is needed which can be enforced (on +request) to ensure that DFDL schemas will be usable with a variety of data processing systems. +Creating DFDL schemas that adhere to this standard profile ensures maximal interoperability, +including the ability to convert into JSON without name/namespace collisions. + +This is a proposal for a switch/option to be added to Daffodil which turns on enforcement of this +standard profile, aka subset of DFDL. + +## Standard Profile Restrictions + +### No Anonymous Choices + +Choices must be the model groups of complex type definitions and are not allowed in any other +context. + +Each choice branch must begin with a different element. (This is already a XML Schema requirement - +Unique Particle Attribution.) + +### Group References Cannot Carry DFDL Properties + +Group references are allowed, but DFDL properties cannot be expressed on group references; hence, +combining those properties with those of the group definition is not required. + +While most data structure systems do not have this notion of reusable groups, when restricted as +described, reusable groups are something users could implement by way of a simple macro +pre-processor, so having this in the standard profile really does not create any particular +challenge when mapping from DFDL standard profile schemas into any data structure system. +Groups and group references are used heavily in DFDL schemas to push down complexity like +discriminators that are reused in many places. +Allowing groups and group references reduces the difficulty of converting many large DFDL +schemas to conform to the standard profile. + +### No Element References + +There is no corresponding form of sharing in most data structure systems. + +### No Namespace-Qualified Names + +Only elementFormDefault 'unqualified' is allowed. +Note that this is the default for XML Schema and DFDL. + +### Unique Namespace Prefixes + +All namespace prefixes must be unique in the entire schema. + +This enables one to create unique identifiers by concatenating prefix_local to create global names. + +### All Element Children Have Unique Names + +All children element declarations must have unique names within their enclosing parent element. + +#### Discussion +Note that this causes issues in a number of large DFDL schemas (e.g, VMF) which attempt to implement a +single DFDL schema that is capable of handling multiple versions of the data format. + +In this case, the schema uses a construct like: +```xml +<choice> + <sequence> + <element name="C" type="zString"/> + <element name="hdr" type="hdr_version_C_type"/> + </sequence> + <sequence> + <element name="D" type="zString"/> + <element name="hdr" type="hdr_version_D_type"/> + </sequence> +</choice> +``` +In the above, you can see that there are two separate element declarations named hdr, of different types. +This allows common sub-structure that is the same in versions C and D to be addressed by path expressions that are +polymorphic. They do not have a path step component that identifies the version. + +However, if we require all children to have unique names, then this would have to be elements with distinct names on each +branch such as hdrC and hdrD, and then paths, even those reaching sub-fields that are common to both versions, +have path steps that are specifically requesting a particular version. + +This is a bit painful particularly if there are many expressions that need to reference into the common fields, because +all such expressions would need to be duplicated for version C and version D. + +An alternative solution is that this could be overcome by a way of creating path expressions with wildcards in +them eg., ".../hdr*/...". +An extension of this kind in DFDL has already been proposed/discussed some time ago by the DFDL workgroup, but has +not yet turned into a formal proposal. (The DFDL4Space implementation by the ESA has a kind of wildcard feature like +this as a DFDL extension.) + +Another way of addressing this is to put the version distinction at precisely each point of difference between the +schema versions. +This is, however, not the way some large schemas were created, as these schemas are machine generated from the +individual format specifications. The generator is not aware of the individual differences between the versions, +but only that there are *some* differences between them. +It requires a more sophisticated schema generator to compute these fine-level diffs between the two schemas. + + +### Nillable Simple Types Only (TBD: May not be necessary) + +Nillable is allowed only for simple type elements. + +### Element Name/Identifier Restrictions + +Element names must consist of all non-whitespace characters from the +Unicode basic multilingual plane (no surrogate pairs in element names). + +They may not contain any control characters (Uncode class Cc) may not contain various punctuation +characters (Uncode class Ps, Pe, Pd, Pc, Pf, Pi, Po, nor $). + +This is a lowest-common denominator of identifier rules intended to allow +DFDL schema identifiers to be mapped into ANY programming language or +structure declaration language, while at the same time allowing use of +Unicode characters. + +Element names may not begin with a digit. + +Users are encourated to use "A-Za-Z0-9" only as some systems may not allow use of unicode in +identifiers. + +Element names may not begin with any prefix defined as part of the schema +followed by an "_" as this could be ambiguous with names being made globally +unique by appending prefix, "_" and local name. + +### String Content Restrictions + +Schemas may only be written in UTF-8 encoding. + +The DFDL property dfdl:utf16Width must be 'fixed'. Review Comment: What is `dfdl:utf16Width` used for, and why must it be 'fixed'? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
