On Tue, Feb 13, 2018 at 03:00:59PM -0700, Jonathan M Davis via Digitalmars-d-announce wrote: [...] > The big problem is how the entity references affect the parsing. If > start tags can be dropped in and affect the parsing (and it's still > not clear to me from the spec whether that's legal - there is a > section talking about being nested properly which might indicate that > that's not legal, but it's not very specific or clear), and if it's > legal to do something like use an entity reference for a tag name - > e.g. <&foo;>, then that's a serious problem. And problems like that > are the main reason why I completely dropped any attempt to do > anything with the DTD section.
AFAICT, section 4.3.2 in the spec (probably the one you're referring to) seems to be saying that you can't do that: A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another. > If entity references are only legal in the text between start and end > tags and between the quotes of attribute values, and whatever they're > replaced with cannot actually affect anything else in the XML document > (i.e. it can't just be a start or end tag or anything like that - it > has to be fulling parseable on its own and not affect the parsing of > the document itself), then passing them along should be fine. That's the approach I'm thinking of. [...] > Regardless, there's no risk of dxml's parser ever being changed to > actually replace entity references. That doesn't work with returning > slices of the original input, and it really doesn't work with a parser > that's just supposed to take a range of characters and parse it. To > fully handle all of the DTD stuff means actually reading files from > disk or from the internet - which of course is where the security > problems come in, but it also means that you're not just dealing with > a parser anymore. In principle, dxml's parser should be pure (though > some implementation make it so that it isn't right now), whereas an > XML parser that fully handles the DTD section could never be pure. [...] Given the insane complexities of DTD that I'm only slowly beginning to grasp from actually reading the spec, I'm quickly adopting the opinion that dxml should remain as-is, and any DTD implementation should be layered on top. The only potential changes that might be needed is: - provide a way to parse XML snippets that don't have a <?xml ...> declaration, so that a DTD implementation could, for example, hand an entity body over to dxml to extract any tags that may be nested in there (and if my reading of section 4.3.2 is correct, all such tags must always be closed inside the entity body, so there should be no errors produced). - provide some way of hooking into non-default entities so that DTD-defined entities can be expanded by the DTD implementation. This could be as simple as leaving such entities untouched in the returned range, or invent a special EntityType representing such entities (with a slice of the input containing the entity name) so that the DTD implementation can insert the replacement text. Everything else should be handled by the DTD layer, e.g., parsing the DOCTYPE section (which is itself pretty pathological, given the actual examples in the W3C spec to this effect), expanding entities, looking up external entities, limiting recursive entity expansion, implementing a security model, etc.. T -- Why do conspiracy theories always come from the same people??