On Tue, Feb 13, 2018 at 09:18:12PM +0000, Patrick Schluter via Digitalmars-d-announce wrote: > On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis wrote: [...] > > If it's 100% sure that entity references can be treated as just text > > and that you can't end up with stuff like start tags or end tags > > being inserted and messing with the parsing such that they all have > > to be replaced for the XML to be correctly parsed, then I have no > > problem passing entity references along, and a higher level parser > > could try to do something with them, but it's not clear to me at all > > that an XML document with entity references is correct enough to be > > parsed while not replacing the entity references with whatever XML > > markup they contain. I had originally passed them along with the > > idea that a higher level parser could do something with them, but I > > decided that I couldn't do that if you could do something like drop > > a start tag in there and change the meaning of the stuff that needs > > to be parsed that isn't directly in the entity reference.
This made me go to the W3C spec (https://www.w3.org/TR/xml/) to figure out what exactly is/isn't defined. I discovered to my chagrin that XML entities are a huge rabbit hole with extremely pathological behaviour that makes it almost impossible to implement in any way that's even remotely efficient. Here's a page with examples of how nasty it can get: http://www.floriankaeferboeck.at/XML/Comparison.html Here's an example given in the W3C spec itself: <?xml version='1.0'?> <!DOCTYPE test [ <!ELEMENT test (#PCDATA) > <!ENTITY % xx '%zz;'> <!ENTITY % zz '<!ENTITY tricky "error-prone" >' > %xx; ]> <test>This sample shows a &tricky; method.</test> A correct XML parser is supposed to produce the following text as the body of the <test>...</test> tag (the grammatical error is intentional): This sample shows a error-prone method. Fortunately, there's a glimmer of hope on the horizon: in section 4.3.2 of the spec (https://www.w3.org/TR/xml/#wf-entities), it is explicitly stated: A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another. Meaning, if I understand it correctly, that you can't have a start tag in &entity1; and its corresponding end tag in &entity2;, and then have your document contain "&entity1; &entity2;". This is because the body of the entity can only contain text or entire tags (the production "content" in the spec); an entity that contains an open tag without an end tag (or vice versa) does not match this rule and is thus illegal. So this means that we *can* use dxml as a backend to drive a DTD-supporting XML parser implementation. The wrapper / higher-level parser would scan the slices returned by dxml for entity references, and substitute them accordingly, which may involve handing the body of the entity to another instance of dxml to parse any tags that may be nested in there. The nastiness involving partially-formed entity references (as seen in the above examples) apparently only applies inside the DOCTYPE declaration, so AIUI this can be handled by the higher-level parser as part of replacing inline entities with their replacement text. (The higher-level parser has a pretty tall order to fill, though, because entities can refer to remote resources via URI, meaning that an innocuous-looking 5-line XML file can potentially expand to terabytes of XML tags downloaded from who knows how many external resources recursively. Not to mention a bunch of security issues like described below.) > There's also the issue that entity references open a whole can of > worms concerning security. It quite possible to have an exponential > growing entity replacement that can take down any parser. > > <!DOCTYPE root [ > <!ELEMENT root ANY> > <!ENTITY LOL "LOL"> > <!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;"> > <!ENTITY LOL2 > "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;"> > <!ENTITY LOL3 > "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;"> > <!ENTITY LOL4 > "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;"> > <!ENTITY LOL5 > "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;"> > <!ENTITY LOL6 > "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;"> > <!ENTITY LOL7 > "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;"> > <!ENTITY LOL8 > "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;"> > <!ENTITY LOL9 > "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;"> > ]> > <root>&LOL9;</root> > > Hope you have enough memory (this expands to a 3 000 000 000 LOL's) [...] Yeah, after reading through relevant portions of the spec, I have to say that full DTD support is a HUGE can of worms. I tip my hats off in advance to the brave soul (or poor fool :-P) who would attempt to implement the spec in full. :-D There are ways to deal with exponential entity growth, e.g., if the expansion was carried out lazily. But it's still a DOS vulnerability if the software then spins practically forever trying to traverse the huge range of stuff being churned out. Not to mention that having embedded external references is itself a security issue, particular since the partial entity formation thing can be used to obfuscate the real URI of a referenced entity, so you could potentially trick a remote XML parser to download stuff from questionable sources. It could be used as a covert surveillance method, for example, or a malware delivery vector, if combined with an exploitable bug in the parser code. Or it could be used to read sensitive files (e.g., if an entity references file:///etc/passwd or some such system file). Ick. Ironically, the general advice I found online w.r.t XML vulnerabilities is "don't allow DTDs", "don't expand entities", "don't resolve externals", etc.. There also aren't many XML parsers out there that fully support all the features called for in the spec. IOW, this basically amounts to "just use dxml and forget about everything else". :-D Now of course, there *are* valid use cases for DTDs... but a naïve implementation of the spec is only going to end in tears. My current inclination is, just merge dxml into Phobos, then whoever dares implement DTD support can do so on top of dxml, and shoulder their own responsibility for vulnerabilities or whatever. (I mean, seriously, just for the sake of being able to say "my XML is validated" we have to implement network access, local filesystem access, a security framework, and what amounts to a sandbox to control pathological behaviour like exponentially recursive entities? And all of this, just to handle rare corner cases? That's completely ridiculous. It's an obvious design smell to me. The only thing missing from this poisonous mix is Turing completeness, which would have made XML hackers' heaven. Oh wait, on further googling, I see that XSLT *is* Turing complete. Great, just great. Now I know why I've always had this gut feeling that *something* is off about the whole XML mania.) T -- English is useful because it is a mess. Since English is a mess, it maps well onto the problem space, which is also a mess, which we call reality. Similarly, Perl was designed to be a mess, though in the nicest of all possible ways. -- Larry Wall
