On 11.03.2015 21:53, travis+ml-lang...@subspacefield.org wrote: > I think the closest term in natural language parsing is polysemy; > one XML document, different meanings. >
There is some research on expressiveness of XML Schema (XSD) that includes a notion of ambiguity [1]. An XSD is basically a regular tree grammar for XML Infosets, where nonterminals are complex types and terminals are simple types for content. Processing then means to assign types to Infoset entities and call the type-associated functions. An unrestricted regular tree grammar allows ambiguity in the sense that there are multiple valid type assignments for an XML Infoset. When you process the XML Infoset in document order, there could also be nondeterminism in type assignment that only resolves at a later location in the Infoset. Ambiguity always implies nondeterminism, and nondeterminism makes stream processing really hard. The XSD standard therefore enforces two syntactic rules (UPA, EDC) to guarantee deterministic schemas, so every parsable XML Infoset has a unique type assignment for Infoset entities and therefore unique interpretation. When only the fundamental structures of XML and its schema languages are considered, you can resort to neat formalisms [2,3], where decision problems are well-researched, e.g., unranked regular tree automata or in my case visibly pushdown automata. Text contents between two tags are from an infinite domain and can be dealt with more complex formalisms, e.g., data tree automata, or by reducing the infinite domain to a finite number of datatypes, e.g., the XSD datatypes for simple types. However, the trouble starts with XML features that affect language properties. An IDREF is basically a self-reference, and self-references are an indicator for context-sensitivity. So, no regular formalism can decide validity when IDREFs are involved, and in practice, the XML parser or business logic (= shotgun parsing) does the checking. A lot of things can go wrong here: best example is the XML Signature Wrapping attack [4]. A more fundamental problem is, in my opinion, how "extensibility" is achieved in XML and XSD. Basically all web services standards (and their XSDs), including XMPP, have been designed to be modular. How does an XML parser know, what "module" should be used for type assigment? A so-called schema extension point looks like this in XSD: <xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded" processContents="lax"/> xs:any is the AnyType (a wildcard for any subtree at this location in an Infoset), namespace="##other" instructs the parser that any well-formed XML that is from a namespace other than the target namespace of the type being defined is allowed, and processContents="lax" instructs to validate the entity according to its namespace, but if the schema cannot be obtained, skip the substree and no errors will occur (!!!). A schema validating parser therefore tries to fetch an XSD into its cache when at a schema extension point an Infoset entity with unknown namespace appears during validation. If retrieval of the namespace-URL fails, skip validation of the subtree without error; that is why XML Signature Wrapping works even when schema validation is turned on because the originally signed subtree is hidden in a wrapper with bogus namespace (skipped). But I think, there is much more possible if a validating XML parser can guided to fetch a remote XSD, aka. "schema injection". I haven't seen anything is this direction; at least DoS should be possible, but maybe more fundamental stuff too. Cheers, Harald [1] W. Martens, F. Neven, T. Schwentick, and G. J. Bex, “Expressiveness and Complexity of XML Schema,” ACM Trans. Database Syst., vol. 31, no. 3, pp. 770–813, Sep. 2006. [2] T. Schwentick, “Automata for XML—A Survey,” J. Comput. Syst. Sci., vol. 73, no. 3, pp. 289–315, May 2007. [3] F. Neven, “Automata Theory for XML Researchers,” ACM SIGMOD Rec., vol. 31, no. 3, p. 39, Sep. 2002. [4] J. Somorovsky, A. Mayer, M. Kampmann, M. Jensen, C. Kg, J. Schwenk, and M. J. De, “On Breaking SAML : Be Whoever You Want to Be,” USENIX Secur. ’12, 2012. _______________________________________________ langsec-discuss mailing list langsec-discuss@mail.langsec.org https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss