On 11.03.2015 21:53, travis+ml-lang...@subspacefield.org wrote:
> I think the closest term in natural language parsing is polysemy;
> one XML document, different meanings.
> 

There is some research on expressiveness of XML Schema (XSD) that
includes a notion of ambiguity [1]. An XSD is basically a regular tree
grammar for XML Infosets, where nonterminals are complex types and
terminals are simple types for content. Processing then means to
assign types to Infoset entities and call the type-associated
functions. An unrestricted regular tree grammar allows ambiguity in
the sense that there are multiple valid type assignments for an XML
Infoset. When you process the XML Infoset in document order, there
could also be nondeterminism in type assignment that only resolves at
a later location in the Infoset. Ambiguity always implies
nondeterminism, and nondeterminism makes stream processing really
hard. The XSD standard therefore enforces two syntactic rules (UPA,
EDC) to guarantee deterministic schemas, so every parsable XML Infoset
has a unique type assignment for Infoset entities and therefore unique
interpretation.

When only the fundamental structures of XML and its schema languages
are considered, you can resort to neat formalisms [2,3], where
decision problems are well-researched, e.g., unranked regular tree
automata or in my case visibly pushdown automata.

Text contents between two tags are from an infinite domain and can be
dealt with more complex formalisms, e.g., data tree automata, or by
reducing the infinite domain to a finite number of datatypes, e.g.,
the XSD datatypes for simple types.

However, the trouble starts with XML features that affect language
properties. An IDREF is basically a self-reference, and
self-references are an indicator for context-sensitivity. So, no
regular formalism can decide validity when IDREFs are involved, and in
practice, the XML parser or business logic (= shotgun parsing) does
the checking. A lot of things can go wrong here: best example is the
XML Signature Wrapping attack [4].

A more fundamental problem is, in my opinion, how "extensibility" is
achieved in XML and XSD. Basically all web services standards (and
their XSDs), including XMPP, have been designed to be modular. How
does an XML parser know, what "module" should be used for type assigment?

A so-called schema extension point looks like this in XSD:
<xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded"
processContents="lax"/>

xs:any is the AnyType (a wildcard for any subtree at this location in
an Infoset), namespace="##other" instructs the parser that any
well-formed XML that is from a namespace other than the target
namespace of the type being defined is allowed, and
processContents="lax" instructs to validate the entity according to
its namespace, but if the schema cannot be obtained, skip the substree
and no errors will occur (!!!).

A schema validating parser therefore tries to fetch an XSD into its
cache when at a schema extension point an Infoset entity with unknown
namespace appears during validation. If retrieval of the namespace-URL
fails, skip validation of the subtree without error; that is why XML
Signature Wrapping works even when schema validation is turned on
because the originally signed subtree is hidden in a wrapper with
bogus namespace (skipped). But I think, there is much more possible if
a validating XML parser can guided to fetch a remote XSD, aka. "schema
injection". I haven't seen anything is this direction; at least DoS
should be possible, but maybe more fundamental stuff too.

Cheers,
Harald




[1] W. Martens, F. Neven, T. Schwentick, and G. J. Bex,
“Expressiveness and Complexity of XML Schema,” ACM Trans. Database
Syst., vol. 31, no. 3, pp. 770–813, Sep. 2006.
[2] T. Schwentick, “Automata for XML—A Survey,” J. Comput. Syst. Sci.,
vol. 73, no. 3, pp. 289–315, May 2007.
[3] F. Neven, “Automata Theory for XML Researchers,” ACM SIGMOD Rec.,
vol. 31, no. 3, p. 39, Sep. 2002.
[4] J. Somorovsky, A. Mayer, M. Kampmann, M. Jensen, C. Kg, J.
Schwenk, and M. J. De, “On Breaking SAML : Be Whoever You Want to Be,”
USENIX Secur. ’12, 2012.
_______________________________________________
langsec-discuss mailing list
langsec-discuss@mail.langsec.org
https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

Reply via email to