On Monday, February 12, 2018 07:59:24 H. S. Teoh via Digitalmars-d-announce wrote: > On Mon, Feb 12, 2018 at 07:04:38AM -0700, Jonathan M Davis via > Digitalmars-d-announce wrote: [...] > > > However, if folks as a whole think that Phobos' xml parser needs to > > support the DTD section to be acceptable, then dxml won't replace > > std.xml, because dxml is not going to implement DTD support. DTD > > support fundamentally does not fit in with dxml's design. > > Actually, thinking about this, I'm wondering if a combination of > preprocessing and/or postprocessing might make it possible to implement > DTD support without needing to rewrite the guts of dxml. AIUI, dxml does > parse the DTD section correctly, i.e., as an XML directive, but only > doesn't look into its internal details. So one way to implement DTD > support might be: > > - Write an auxiliary parser that's basically a wrapper around dxml, > forwarding XML events to the caller, except: > - If a DTD event is encountered, eagerly parse it, store DTD > declarations internally for future reference. > - If there's a DTD that has been seen, perform on-the-fly validation as > XML events are forwarded. > - In PCDATA sections, if there are entity references to the DTD, expand > them, possibly inserting more XML events into the stream based on > what's defined in the DTD. (This may need to reuse some dxml internals > to parse XML snippets that might be contained in an entity definition, > for example.)
The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference. If we were going to stick to strings and only strings, it would be quite possible to define the API in a way that it may or may not do DTD processing, but that doesn't work with arbitrary ranges of characters, not unless you give up on returning slices of the original input, and that means harming the performance and usability for the common case in order to support DTDs. Also, anything that has the concept of "events" would be drastically different from what dxml does. dxml is completely range-based. It has no callbacks or anything of the sort, and having anything like that would complicate it considerably. There are lots of interesting things that could be done to try and deal with the DTD section, but they fundamentally don't work with returning slices of the original input unless you're only using strings. In any case, I refuse to change dxml so that it has DTD support, and I refuse to change it so that it doesn't return slices of the original input. If I were to do so, it would make the parser worse for any use case I care about and require a lot of time and effort on my part that I'm not willing to spend. So, if that makes it so that dxml is never included in Phobos, then so be it. Folks are free to decide to support dxml for inclusion when the time comes and free to vote it as unacceptable. Personally, I think that dxml's approach is ideal for XML that doesn't use entity references, and I'd much rather use that kind of parser regardless of whether it's in the standard library or not. I think that the D community would be far better off with std.xml being replaced by dxml, but whatever happens happens. I'd be just as fine with a decision to remove std.xml and not include dxml. I'm less fine with std.xml being left in Phobos and dxml being rejected, because std.xml has been recognized as bad, and it sure doesn't look like anyone else is going to write a replacement any time soon. I also think that dxml's approach is better for the common case than anything that supported DTDs would be, so I think that having dxml's solution in Phobos would be better for the community even if Phobos also had a solution that supported DTDs, but at this point, it looks like the options are going to be 1. std.xml stays and continues to suck. 2. std.xml gets ripped out and dxml replaces it. 3. std.xml gets ripped out and we have no xml solution in Phobos. But as it stands, it doesn't seem likely that any XML solution that supports DTDs being in Phobos is likely to happen any time soon, if ever, because AFAIK, only three people have put in any real effort towards replacing std.xml since 2010 or whenever it was that we decided it needed to be replaced. The first two people both disappeared into oblivion without ever finishing, and here I am with a working StAX parser (now with DOM support) and an XML writer in the works - and given how involved I am with D, I think that it's pretty unlikely that I'm disappearing anywhere short of getting hit by a bus or whatnot. So, at least I've actually put in the time and effort towards a solution and made it available, and it will almost certainly be an essentially complete solution by the time that dconf rolls around if not well before. So, I do expect that the question of Phobos inclusion will ultimately be a question of whether std.xml _ever_ gets replaced, but regardless, at least there is a solution, and it will continue to be available as a 3rd party library even if it never makes it into Phobos. - Jonathan M Davis
