Re: dxml 0.2.0 released

2018-09-13 Thread H. S. Teoh via Digitalmars-d-announce
On Thu, Aug 30, 2018 at 07:26:28PM +, nkm1 via Digitalmars-d-announce wrote:
> On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis wrote:
> > Folks are free to decide to support dxml for inclusion when the time
> > comes and free to vote it as unacceptable. Personally, I think that
> > dxml's approach is ideal for XML that doesn't use entity references,
> > and I'd much rather use that kind of parser regardless of whether
> > it's in the standard library or not. I think that the D community
> > would be far better off with std.xml being replaced by dxml, but
> > whatever happens happens.

+1.  I vote for adding dxml to Phobos.


[...]
> I'm using dxml now, and it's a very good library. So I thought "it
> should be in Phobos instead of std.xml" and searched the newsgroup.
> Sorry for necroposting. Anyway, what I wanted to say is just take an
> example from Perl and call it std.xml.simple. Then people would know
> what to expect from it and would use it (because everyone likes
> simple). That would also leave a way to include std.xml.full (or some
> such) at some indefinite point in the future. Which is, in practice,
> probably never - and that's fine, because who needs DTD? screw it...
[...]

That's a good idea, actually.  That will stop people who expect full
DTD support from complaining that it's not supported by the standard
library.

I vote for adding dxml to Phobos as std.xml.simple.  We can either leave
std.xml as-is, or deprecate it and work on std.xml.full (or
std.xml.complex, or whatever).  The current state of std.xml gives a
poor impression to anyone coming to D the first time and wanting to work
with XML, and having std.xml.simple would be a big plus.


T

-- 
This is not a sentence.


Re: dxml 0.2.0 released

2018-08-30 Thread nkm1 via Digitalmars-d-announce
On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis 
wrote:
Folks are free to decide to support dxml for inclusion when the 
time comes and free to vote it as unacceptable. Personally, I 
think that dxml's approach is ideal for XML that doesn't use 
entity references, and I'd much rather use that kind of parser 
regardless of whether it's in the standard library or not. I 
think that the D community would be far better off with std.xml 
being replaced by dxml, but whatever happens happens.

Bump!
I'm using dxml now, and it's a very good library. So I thought 
"it should be in Phobos instead of std.xml" and searched the 
newsgroup. Sorry for necroposting. Anyway, what I wanted to say 
is just take an example from Perl and call it std.xml.simple. 
Then people would know what to expect from it and would use it 
(because everyone likes simple). That would also leave a way to 
include std.xml.full (or some such) at some indefinite point in 
the future. Which is, in practice, probably never - and that's 
fine, because who needs DTD? screw it...

Anyway, thanks for the library, Jonathan.


Re: dxml 0.2.0 released

2018-02-23 Thread Jesse Phillips via Digitalmars-d-announce
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:

dxml 0.2.0 has now been released.
Documentation: http://jmdavisprog.com/docs/dxml/0.2.0/
Github: https://github.com/jmdavis/dxml/tree/v0.2.0
Dub: http://code.dlang.org/packages/dxml

- Jonathan M Davis


This is absolutely awesome. It is a little low level (compared to 
SAX) so there is more to deal with, but having this provide a 
range (and flat) makes it so much clearer the ordering of 
elements. If I need to handle nesting then I can build that out, 
but if I don't I can just fly by the seat of my pants and grab 
the elements I want.


This will definitely be my goto for XML parsing.


Re: dxml 0.2.0 released

2018-02-15 Thread jmh530 via Digitalmars-d-announce
On Thursday, 15 February 2018 at 02:40:03 UTC, Jonathan M Davis 
wrote:


LOL. That's actually part of what makes writing range-based 
libraries so much harder to get right than simply using ranges 
in your program. [snip]


That sounds like an interesting topic for a blog post.


Re: dxml 0.2.0 released

2018-02-14 Thread Jonathan M Davis via Digitalmars-d-announce
On Thursday, February 15, 2018 01:55:28 rikki cattermole via Digitalmars-d-
announce wrote:
> On 14/02/2018 5:13 PM, Jonathan M Davis wrote:
> > On Wednesday, February 14, 2018 14:09:21 rikki cattermole via
> > Digitalmars-d->
> > announce wrote:
> >> On 14/02/2018 2:02 PM, Adrian Matoga wrote:
> >>> On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole 
wrote:
>  See lines:
>  - Input!IR temp = input;
>  - input = temp;
> 
>  bool commentLine() {
> 
>   Input!IR temp = input;
> 
>  (...)
> 
>   if (!temp.empty) {
> 
>  (...)
> 
>   input = temp;
>   return true;
> 
>   } else
> 
>   return false;
> 
>   }
> >>>
> >>> `temp = input.save` is exactly what you want here, which means forward
> >>> range is required. Your example won't work for range objects with
> >>> reference semantics.
> >>
> >> Ah I must be thinking of ranges that support indexing.
> >
> > Random access ranges are also forward ranges and would require a call to
> > save here.
> >
> > - Jonathan M Davis
>
> Luckily in my code I can forget that ;)

LOL. That's actually part of what makes writing range-based libraries so
much harder to get right than simply using ranges in your program. When a
piece of code is used with only a few types of ranges (or even only one type
of range, as is often the case), then it's generally not very hard to write
code that works just fine, but as soon as you have to worry about arbitrary
ranges, you get all kinds of nonsense that you have to worry about in order
to make sure that the code works correctly for any range that's passed to
it. save is the classic example of something that a lot of range-based code
gets wrong, because for most ranges, it really doesn't matter, but for those
ranges where it does, a single missed call to save results in code that
doesn't work properly. To get it right, you basically have to call save
every time you pass a range to a range-based function that is not supposed
to consume the range, and folks rarely get that right. Certainly, pretty
much any range-based code that doesn't have unit tests which include
reference-type ranges is going to be wrong for reference-type ranges. Even
Phobos has had quite a few issues with that historically.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-14 Thread rikki cattermole via Digitalmars-d-announce

On 14/02/2018 5:13 PM, Jonathan M Davis wrote:

On Wednesday, February 14, 2018 14:09:21 rikki cattermole via Digitalmars-d-
announce wrote:

On 14/02/2018 2:02 PM, Adrian Matoga wrote:

On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:

See lines:
- Input!IR temp = input;
- input = temp;

bool commentLine() {
 Input!IR temp = input;

(...)
 if (!temp.empty) {
(...)
 input = temp;
 return true;
 } else
 return false;
 }


`temp = input.save` is exactly what you want here, which means forward
range is required. Your example won't work for range objects with
reference semantics.


Ah I must be thinking of ranges that support indexing.


Random access ranges are also forward ranges and would require a call to
save here.

- Jonathan M Davis



Luckily in my code I can forget that ;)


Re: dxml 0.2.0 released

2018-02-14 Thread Jonathan M Davis via Digitalmars-d-announce
On Wednesday, February 14, 2018 14:09:21 rikki cattermole via Digitalmars-d-
announce wrote:
> On 14/02/2018 2:02 PM, Adrian Matoga wrote:
> > On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:
> >> See lines:
> >> - Input!IR temp = input;
> >> - input = temp;
> >>
> >>bool commentLine() {
> >> Input!IR temp = input;
> >>
> >> (...)
> >> if (!temp.empty) {
> >> (...)
> >> input = temp;
> >> return true;
> >> } else
> >> return false;
> >> }
> >
> > `temp = input.save` is exactly what you want here, which means forward
> > range is required. Your example won't work for range objects with
> > reference semantics.
>
> Ah I must be thinking of ranges that support indexing.

Random access ranges are also forward ranges and would require a call to
save here.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-14 Thread rikki cattermole via Digitalmars-d-announce

On 14/02/2018 2:02 PM, Adrian Matoga wrote:

On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:

See lines:
- Input!IR temp = input;
- input = temp;

   bool commentLine() {
    Input!IR temp = input;

(...)
    if (!temp.empty) {
(...)
    input = temp;
    return true;
    } else
    return false;
}


`temp = input.save` is exactly what you want here, which means forward 
range is required. Your example won't work for range objects with 
reference semantics.


Ah I must be thinking of ranges that support indexing.


Re: dxml 0.2.0 released

2018-02-14 Thread Adrian Matoga via Digitalmars-d-announce
On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole 
wrote:

See lines:
- Input!IR temp = input;
- input = temp;

   bool commentLine() {
Input!IR temp = input;

(...)
if (!temp.empty) {
(...)   
input = temp;
return true;
} else
return false;
}


`temp = input.save` is exactly what you want here, which means 
forward range is required. Your example won't work for range 
objects with reference semantics.


Re: dxml 0.2.0 released

2018-02-14 Thread rikki cattermole via Digitalmars-d-announce

On 14/02/2018 10:32 AM, Jonathan M Davis wrote:

On Wednesday, February 14, 2018 10:14:44 Kagamin via Digitalmars-d-announce
wrote:

It looks like EntityRange requires forward range, is it ok for a
parser?


It's very difficult in general to write a parser that isn't at least a
forward range, because without that, you're stuck at only one character of
look ahead unless you play a lot of games with putting data from the input
range in a buffer so that you can keep it around to look at it again after
you've looked farther ahead.

Honestly, pure input ranges are borderline useless for a _lot_ of cases.
It's generally only the cases where you only care about operating on each
element individually irrespective of what's going on with other elements in
the range that pure input ranges are really useable, and parsing definitely
doesn't fall into that camp.

- Jonathan M Davis


See lines:
- Input!IR temp = input;
- input = temp;

   bool commentLine() {
Input!IR temp = input;

if (!temp.empty && temp.front.c == '/') {
temp.popFront;
if (!temp.empty && temp.front.c == '/')
temp.popFront;
else
return false;
} else
return false;

if (!temp.empty) {
size_t endOffset = temp.front.location.fileOffset;

while(temp.front.location.lineOffset != 0) {
endOffset = temp.front.location.fileOffset;
temp.popFront;

if (temp.empty) {
endOffset++;
break;
}
}

current.type = Token.Type.Comment_Line;
current.location = input.front.location;
current.location.length = endOffset - 
input.front.location.fileOffset;

input = temp;
return true;
} else
return false;
}


Re: dxml 0.2.0 released

2018-02-14 Thread Jonathan M Davis via Digitalmars-d-announce
On Wednesday, February 14, 2018 10:14:44 Kagamin via Digitalmars-d-announce 
wrote:
> It looks like EntityRange requires forward range, is it ok for a
> parser?

It's very difficult in general to write a parser that isn't at least a
forward range, because without that, you're stuck at only one character of
look ahead unless you play a lot of games with putting data from the input
range in a buffer so that you can keep it around to look at it again after
you've looked farther ahead.

Honestly, pure input ranges are borderline useless for a _lot_ of cases.
It's generally only the cases where you only care about operating on each
element individually irrespective of what's going on with other elements in
the range that pure input ranges are really useable, and parsing definitely
doesn't fall into that camp.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-14 Thread Chris via Digitalmars-d-announce

On Tuesday, 13 February 2018 at 22:13:36 UTC, H. S. Teoh wrote:



Ironically, the general advice I found online w.r.t XML 
vulnerabilities is "don't allow DTDs", "don't expand entities", 
"don't resolve externals", etc..  There also aren't many XML 
parsers out there that fully support all the features called 
for in the spec.  IOW, this basically amounts to "just use dxml 
and forget about everything else". :-D


Now of course, there *are* valid use cases for DTDs... but a 
naïve implementation of the spec is only going to end in tears.
 My current inclination is, just merge dxml into Phobos, then 
whoever dares implement DTD support can do so on top of dxml, 
and shoulder their own responsibility for vulnerabilities or 
whatever.  (I mean, seriously, just for the sake of being able 
to say "my XML is validated" we have to implement network 
access, local filesystem access, a security framework, and what 
amounts to a sandbox to control pathological behaviour like 
exponentially recursive entities?  And all of this, just to 
handle rare corner cases?  That's completely ridiculous.  It's 
an obvious design smell to me.  The only thing missing from 
this poisonous mix is Turing completeness, which would have 
made XML hackers' heaven.  Oh wait, on further googling, I see 
that XSLT *is* Turing complete.  Great, just great.   Now I 
know why I've always had this gut feeling that *something* is 
off about the whole XML mania.)



T


Thanks for the analysis. I'd say you're right. It makes no sense 
to keep dxml from becoming std.xml's successor only because it 
doesn't support DTDs. Also, as I said before, if we had DTD 
support in std.xml, people would complain about the lack of 
efficiency, and the discussion about interpreting the specs 
correctly, implementing them 100%, complaints about the lack of 
security would just never end.


Re: dxml 0.2.0 released

2018-02-14 Thread Jonathan M Davis via Digitalmars-d-announce
On Wednesday, February 14, 2018 10:03:45 Patrick Schluter via Digitalmars-d-
announce wrote:
> On Tuesday, 13 February 2018 at 22:00:59 UTC, Jonathan M Davis
>
> wrote:
> > On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via
> >
> > Digitalmars-d- announce wrote:
> >> [...]
> >
> > Well, if dxml just passes the entity references along unparsed
> > beyond validating that the entity reference itself contains
> > valid characters (e.g. it's not something like &.; or & by
> > itself), then dxml would still not be replacing the entity
> > references with anything. Any security or performance problems
> > associated with entity references would be left up to whatever
> > parser parsed the DTD section and then used dxml to parse the
> > rest of the XML and replaced the entity references in dxml's
> > parsing results with whatever they were.
> >
> > The big problem is how the entity references affect the
> > parsing. If start tags can be dropped in and affect the parsing
> > (and it's still not clear to me from the spec whether that's
> > legal - there is a section talking about being nested properly
> > which might indicate that that's not legal, but it's not very
> > specific or clear), and if it's legal to do something like use
> > an entity reference for a tag name - e.g. <>, then that's
> > a serious problem. And problems like that are the main reason
> > why I completely dropped any attempt to do anything with the
> > DTD section.
>
> Yikes! In any case, even if I had to implement a parser I would
> tend to not implement this "feature" as it sounds quite
> unreasonable. Only if a real need (i.e. one in the real world,
> not one that could be contrived out of the specs) arises would I
> then potentially implement the real deal.

Well, since folks other than me are going to use this parser, and it's even
potentially going to end up in D's standard library, it needs to at least be
good enough to not let through invalid XML or incorrectly interpret any XML.
It can potentially not support portions of the spec as long as it does so in
a clear and clean manner, but it's going to have to correctly handle
anything that it does handle.

For better or worse, I'm the sort of person who prefers to completely
implement a spec when I'm implementing one, but in this case, it wasn't
really reasonable. Fortunately however, from the perspective of implementing
something that's useful for me personally, the DTD section is completely
unnecessary. From that perspective, processing instructions and CDATA
sections are also unnecessary, since I'd never do anythnig with them, but I
don't think that it would be reasonable to skip those, so they're
implemented. And it's not like they're hard to implement support for, unlike
the DTD section.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-14 Thread Kagamin via Digitalmars-d-announce

On Tuesday, 13 February 2018 at 22:29:27 UTC, H. S. Teoh wrote:

- provide some way of hooking into non-default entities so that
  DTD-defined entities can be expanded by the DTD 
implementation.


The parser now returns raw text, entity replacement can be done 
by DTD processor without any modification of API. So it's good 
for experimental if there's incentive to maintain it, but it's 
purely PR problem: there's nothing wrong in having xml support in 
dub registry and std.xml in phobos, if phobos is ok with it, it 
can stay as is.
It looks like EntityRange requires forward range, is it ok for a 
parser?


Re: dxml 0.2.0 released

2018-02-14 Thread Patrick Schluter via Digitalmars-d-announce
On Tuesday, 13 February 2018 at 22:00:59 UTC, Jonathan M Davis 
wrote:
On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via 
Digitalmars-d- announce wrote:

[...]


Well, if dxml just passes the entity references along unparsed 
beyond validating that the entity reference itself contains 
valid characters (e.g. it's not something like &.; or & by 
itself), then dxml would still not be replacing the entity 
references with anything. Any security or performance problems 
associated with entity references would be left up to whatever 
parser parsed the DTD section and then used dxml to parse the 
rest of the XML and replaced the entity references in dxml's 
parsing results with whatever they were.


The big problem is how the entity references affect the 
parsing. If start tags can be dropped in and affect the parsing 
(and it's still not clear to me from the spec whether that's 
legal - there is a section talking about being nested properly 
which might indicate that that's not legal, but it's not very 
specific or clear), and if it's legal to do something like use 
an entity reference for a tag name - e.g. <>, then that's 
a serious problem. And problems like that are the main reason 
why I completely dropped any attempt to do anything with the 
DTD section.


Yikes! In any case, even if I had to implement a parser I would 
tend to not implement this "feature" as it sounds quite 
unreasonable. Only if a real need (i.e. one in the real world, 
not one that could be contrived out of the specs) arises would I 
then potentially implement the real deal.


Re: dxml 0.2.0 released

2018-02-13 Thread Jonathan M Davis via Digitalmars-d-announce
On Tuesday, February 13, 2018 14:29:27 H. S. Teoh via Digitalmars-d-announce 
wrote:
> Given the insane complexities of DTD that I'm only slowly beginning to
> grasp from actually reading the spec, I'm quickly adopting the opinion
> that dxml should remain as-is, and any DTD implementation should be
> layered on top.  The only potential changes that might be needed is:
>
> - provide a way to parse XML snippets that don't have a 
>   declaration, so that a DTD implementation could, for example, hand an
>   entity body over to dxml to extract any tags that may be nested in
>   there (and if my reading of section 4.3.2 is correct, all such tags
>   must always be closed inside the entity body, so there should be no
>   errors produced).

XML 1.0 does not require the  section - which is the main reason
why dxml implements XML 1.0 and not 1.1. When working on one of my projects
with std_experimental_xml, I had to keep adding the  declaration
to the start of XML snippets in all of my tests which had to deal with
sections of an XML document, and it was _really_ annoying.

dxml does require that what it's given be a valid XML 1.0 document, which
means that you have to have exactly one root element in what it's passed,
which does limit which kind of XML snippets you pass it, but it will work
for a lot of XML snippets as-is.

> - provide some way of hooking into non-default entities so that
>   DTD-defined entities can be expanded by the DTD implementation.  This
>   could be as simple as leaving such entities untouched in the returned
>   range, or invent a special EntityType representing such entities (with
>   a slice of the input containing the entity name) so that the DTD
>   implementation can insert the replacement text.

After having actually implemented full parsing for the entire DTD section
before figuring out that references could be inserted in it just about
anywhere and that the grammar in the spec is only the grammar _after_ all of
the replacements were made (when I figured that out was when I gave up on
DTD support), I would strongly argue in favor of simply passing along entity
references as-is and leaving any and all such processing to a DTD-enabled
parser. Originally, the Config had options like SkipDTD and SkipProlog, and
I even provided a way to get at the information in the 
declaration if you wanted it, all that just wasn't worth the extra
complexity.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-13 Thread Jonathan M Davis via Digitalmars-d-announce
On Tuesday, February 13, 2018 14:13:36 H. S. Teoh via Digitalmars-d-announce 
wrote:
> Great, just
> great.   Now I know why I've always had this gut feeling that
> *something* is off about the whole XML mania.)

Well, there are plenty of folks who talk like XML is a pile of steaming muck
that should never be used (and then usually talk about how great JSON is). I
think that basic XML is actually pretty okay - basically the subset that
dxml supports, though if I were designing XML I'd take it a bit further.

Personally, I'd make XML documents completely recursive - meaning that the
top level is the same as any deeper level, so you could have as many element
tags at the top level as you want and as much text as you want, whereas XML
requires a root element and only allows stuff like processing instructions,
comments, and the DOCTYPE stuff outside of the root element.

I'd get rid of the  and  declarations as well as
processing instructions, and I'd probably get rid of the CDATA section in
favor of escaping characters with backslashes like you typically do in
strings (or in JSON), and related to that, I'd get rid of the predefined
entity references, making stuff like & legal. I also might get rid of empty
element tags becase they're annoying to deal with when parsing, but they do
reduce the verbosity of the document such that they might be worth keeping.
It's also tempting to get rid of the tag name on end tags, which would
actually make parsing much easier, but having them helps the legibility of
XML documents, and it's a bit like semicolons in D in the sense that they
can help ensure that error messages refer to the right thing rather than
something later in the document, so I don't know. I'd also allow all Unicode
characters instead of disallowing a number of them, since it won't really
matter for most documents, and then the parser doesn't need to care about
them when validating.

So, basically, you end up with start tags, end tags, and comments, with
start tags optionally having attributes. backslashes would then be used for
escaping stuff, and you end up with something pretty dead simple.

However, as you're finding out when reading through the XML spec, the folks
who created XML didn't think like that at all, and were clearly coming from
a _very_ different point of view as to what an XML document was for and
should contain. But as you might imagine, given my take on what XML should
have been, finding out in detail what XML actually _is_ was pretty
horrifying.

I started dxml with the intention of fully implementing all aspects of the
spec but ultimately decided that it simply wasn't worth it.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-13 Thread Jonathan M Davis via Digitalmars-d-announce
On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via Digitalmars-d-
announce wrote:
> There's also the issue that entity references open a whole can of
> worms concerning security. It quite possible to have an
> exponential growing entity replacement that can take down any
> parser.

Well, if dxml just passes the entity references along unparsed beyond
validating that the entity reference itself contains valid characters (e.g.
it's not something like &.; or & by itself), then dxml would still not be
replacing the entity references with anything. Any security or performance
problems associated with entity references would be left up to whatever
parser parsed the DTD section and then used dxml to parse the rest of the
XML and replaced the entity references in dxml's parsing results with
whatever they were.

The big problem is how the entity references affect the parsing. If start
tags can be dropped in and affect the parsing (and it's still not clear to
me from the spec whether that's legal - there is a section talking about
being nested properly which might indicate that that's not legal, but it's
not very specific or clear), and if it's legal to do something like use an
entity reference for a tag name - e.g. <>, then that's a serious
problem. And problems like that are the main reason why I completely dropped
any attempt to do anything with the DTD section.

If entity references are only legal in the text between start and end tags
and between the quotes of attribute values, and whatever they're replaced
with cannot actually affect anything else in the XML document (i.e. it can't
just be a start or end tag or anything like that - it has to be fulling
parseable on its own and not affect the parsing of the document itself),
then passing them along should be fine.

Basically, if I can change dxml so that in the places where it currently
allows one of the standard entity references to be, it then also allows
other entity references but passes them along without replacing them instead
of throwing an XMLParsingException, and that works without having documents
be screwed up due to missing start tags or something, then passing them
along should be fine. But if entity references allow arbitrary enough chunks
of XML, that doesn't work. It also doesn't work if entity references are
allowed in places other than the text between start and end tags or within
attribute values. And it's not clear to me at all what is legal in an entity
reference or where exactly they're legal. The spec talks about the grammar
being the grammar _after_ all of the references have been replaced, which
makes the grammar rather untrustworthy, and I find the spec very hard to
understand in general.

Regardless, there's no risk of dxml's parser ever being changed to actually
replace entity references. That doesn't work with returning slices of the
original input, and it really doesn't work with a parser that's just
supposed to take a range of characters and parse it. To fully handle all of
the DTD stuff means actually reading files from disk or from the internet -
which of course is where the security problems come in, but it also means
that you're not just dealing with a parser anymore. In principle, dxml's
parser should be pure (though some implementation make it so that it isn't
right now), whereas an XML parser that fully handles the DTD section could
never be pure.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-13 Thread Patrick Schluter via Digitalmars-d-announce
On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis 
wrote:
On Tuesday, February 13, 2018 15:22:32 Kagamin via 
Digitalmars-d-announce wrote:

On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis

wrote:
> The core problem is that entity references get replaced with 
> more XML that needs to be parsed. So, they can't simply be 
> passed on for post-processing. As I understand it, they have 
> to be replaced while the parsing is going on. And that means 
> that you can't do something like return slices of the 
> original input that don't bother with the entity references 
> and then have a separate parser take that and process it 
> further to deal with the entity references. The first parser 
> has to deal with them, and that means not returning slices 
> of the original input unless you're dealing purely with 
> strings and are willing to allocate new strings in the cases 
> where the data needs to be mutated because of an entity 
> reference.


Standard entities like  have the same problem, so the 
same solution should work too.


That depends on what exactly an entity reference can contain. 
If it can do something like put a start tag in there, and then 
it has to be terminated by the document putting an end tag in 
there or another entity reference containing an end tag, then 
it can't be handled after the fact like  can be, since 
 is just replaced by text. If an entity reference can't 
contain a start tag without a matching end tag, then sure. But 
I find the XML spec to be surprisingly hard to understand with 
regards to entity references. It's not clear to me where it's 
even legal to put them or not, let alone what you're allowed to 
put in them exactly. And I can't even really trust the XML 
gramamr as long as entity references are involved, because the 
gramamr in the spec is the grammar _after_ entity references 
have all been replaced, which I was quite dismayed to figure 
out.


If it's 100% sure that entity references can be treated as just 
text and that you can't end up with stuff like start tags or 
end tags being inserted and messing with the parsing such that 
they all have to be replaced for the XML to be correctly 
parsed, then I have no problem passing entity references along, 
and a higher level parser could try to do something with them, 
but it's not clear to me at all that an XML document with 
entity references is correct enough to be parsed while not 
replacing the entity references with whatever XML markup they 
contain. I had originally passed them along with the idea that 
a higher level parser could do something with them, but I 
decided that I couldn't do that if you could do something like 
drop a start tag in there and change the meaning of the stuff 
that needs to be parsed that isn't directly in the entity 
reference.




There's also the issue that entity references open a whole can of 
worms concerning security. It quite possible to have an 
exponential growing entity replacement that can take down any 
parser.



 
 "">
 "">
 "">
 "">
 "">
 "">
 "">
 "">
 "">

]>


Hope you have enough memory (this expands to a 3 000 000 000 
LOL's)






Re: dxml 0.2.0 released

2018-02-13 Thread Kagamin via Digitalmars-d-announce
On Tuesday, 13 February 2018 at 02:53:21 UTC, Nick Sabalausky 
(Abscissa) wrote:

On 02/12/2018 11:15 AM, rikki cattermole wrote:


dxml 7.5k LOC
std.xml 3k LOC

dxml would make the situation a lot worse.


4.5k LOC == "a lot worse"?

Uuuuhhh...WAT?


And it's like 2k LOC of code and 5.5k LOC of tests and docs.


Re: dxml 0.2.0 released

2018-02-13 Thread Kagamin via Digitalmars-d-announce
On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis 
wrote:
The core problem is that entity references get replaced with 
more XML that needs to be parsed. So, they can't simply be 
passed on for post-processing. As I understand it, they have to 
be replaced while the parsing is going on. And that means that 
you can't do something like return slices of the original input 
that don't bother with the entity references and then have a 
separate parser take that and process it further to deal with 
the entity references. The first parser has to deal with them, 
and that means not returning slices of the original input 
unless you're dealing purely with strings and are willing to 
allocate new strings in the cases where the data needs to be 
mutated because of an entity reference.


Standard entities like  have the same problem, so the same 
solution should work too.


Re: dxml 0.2.0 released

2018-02-13 Thread Russel Winder via Digitalmars-d-announce
On Mon, 2018-02-12 at 14:54 +, rikki cattermole via Digitalmars-d-
announce wrote:
> […]
> 
> Personally I find J.M.D. arguments quite reasonable for a third-
> party 
> library, since yes it does cover 90% of the use cases.

The problem is that std.xml needs removing to make it clear there is
no good XML package in Phobos. The people will go looking in the Dub
repository.

-- 
Russel.
===
Dr Russel Winder  t: +44 20 7585 2200
41 Buckmaster Roadm: +44 7770 465 077
London SW11 1EN, UK   w: www.russel.org.uk


signature.asc
Description: This is a digitally signed message part


Re: dxml 0.2.0 released

2018-02-13 Thread Chris via Digitalmars-d-announce

On Monday, 12 February 2018 at 21:51:56 UTC, H. S. Teoh wrote:
[...]
We can even design the DTD support wrapper to start with being 
just a thin wrapper around dxml, and lazily switch to full DTD 
mode only if a DTD section is encountered.  Then user code that 
doesn't care to use dxml's raw API won't even need to care 
about the difference.



T


In this vein, if a new version of std.xml didn't offer pure and 
fast parsing like dxml, but included DTD by default, people would 
complain that that was the real deal breaker (too slow, man!). 
Remember `autodecode`? Right.


DTD inclusion should only be available on demand. Imagine you 
want to implement a library project where ebooks (say classics) 
are catalogued and presented in an ebook reader on the web (or in 
an app on your smart phone). It is likely that the whole DTD 
thing would probably be done at the cataloguing stage, but once 
the books are in the library most users will probably just want 
to go through them page by page or search for quotes etc. - and 
for that you'd need a fast tool like dxml with no overhead.


Re: dxml 0.2.0 released

2018-02-12 Thread Jacob Carlborg via Digitalmars-d-announce

On 2018-02-12 21:19, Chris wrote:

A few lines of code that could be replaced easily once something better 
is available?


Fairly easy because it's so small. I'm actually using the SAX interface 
from std.xml and it quite nicely fits my needs.


--
/Jacob Carlborg


Re: dxml 0.2.0 released

2018-02-12 Thread Nick Sabalausky (Abscissa) via Digitalmars-d-announce

On 02/12/2018 10:49 PM, Jonathan M Davis wrote:


Andrei used to complain periodically about how large std.datetime was,
thinking that it was way too much code, and then someone actually went to
the effort of stripping out all of the comments and unit tests and whatnot
to count the actual lines of code in the implementation, and it was a _way_
smaller number than the lines in the file (IIRC, it might have even been
something like only 10% of the file, if that). That's what happens when you
write documentation and unit tests that are thorough.



Yea, totally. Another example: mysql-native used to be one (!!) source 
file. It was maybe a bit on the large size for a single module, but it 
was still workable. In the last several years, that library has grown 
many times its old size. But now, I'd say that easily the majority of 
lines are either comments or tests. The *actual* implementation and API 
isn't really all that much more LOC than it used to be. The original 
one-module version, by contrast, was less documented and had...I don't 
think it even had a single test (IIRC, the 
now-old-and-probably-bitrotted "app.d" wasn't even there.)


Re: dxml 0.2.0 released

2018-02-12 Thread Jonathan M Davis via Digitalmars-d-announce
On Monday, February 12, 2018 21:53:21 Nick Sabalausky  via Digitalmars-d-
announce wrote:
> On 02/12/2018 11:15 AM, rikki cattermole wrote:
> > dxml 7.5k LOC
> > std.xml 3k LOC
> >
> > dxml would make the situation a lot worse.
>
> 4.5k LOC == "a lot worse"?
>
> Uuuuhhh...WAT?

There is sometimes a tendency for folks to think that something having a lot
of lines of code is bad, and there can be some truth to that. If something
can be done in a simpler way, it tends to be shorter and easier to maintain,
but shorter isn't always better, and simpler isn't always better -
especially if that complexity is needed to get the job done. So, LOC tells
you something, but what it really tells you is up for debate.

And actually, well-written D code is going to have a much higher line count
in general because of stuff like documentation and unit tests being in the
source file. In this case, while std.xml does seem to have a fair bit of
documentation, it has very little in the way of unit tests, whereas dxml has
fairly thorough unit tests - maybe not quite as extreme as std.datetime, but
I do tend to be thorough with unit tests.

Andrei used to complain periodically about how large std.datetime was,
thinking that it was way too much code, and then someone actually went to
the effort of stripping out all of the comments and unit tests and whatnot
to count the actual lines of code in the implementation, and it was a _way_
smaller number than the lines in the file (IIRC, it might have even been
something like only 10% of the file, if that). That's what happens when you
write documentation and unit tests that are thorough.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-12 Thread Nick Sabalausky (Abscissa) via Digitalmars-d-announce

On 02/12/2018 11:15 AM, rikki cattermole wrote:


dxml 7.5k LOC
std.xml 3k LOC

dxml would make the situation a lot worse.


4.5k LOC == "a lot worse"?

Uuuuhhh...WAT?


Re: dxml 0.2.0 released

2018-02-12 Thread Nick Sabalausky (Abscissa) via Digitalmars-d-announce

On 02/12/2018 05:02 PM, H. S. Teoh wrote:

On Mon, Feb 12, 2018 at 02:54:48PM +, rikki cattermole via 
Digitalmars-d-announce wrote:
[...]

Everything you have mentioned is not in Phobos. Just because something
is 'good enough' does not make it 'good enough' for Phobos. In the
words of Andrei "Good enough is not good enough", we need to aim
higher to show what we actually can do.


And thus Phobos continues to let the perfect be the enemy of the good,
and 10 years later std.xml will still be around, and we will still be
arguing over how to replace it.


+Several billion.

Like the improved assert messages we would've had since many years ago 
and was implemented, done and ready to go, but it was instead thrown 
away because...(and here's the real kicker, considering current D 
climate)...because it was a fully in-library solution instead of a new 
compiler feature. Go figure ::eyeroll::



Seriously, I would have thought something like this would be obvious to
programmers of the calibre found on these forums.  I'm a little
astonished that this would even be such a point of contention in the
first place, since the solution is so simple.


I would've expected so too, if it weren't that one of the top favorite 
activities 'round these parts is nitpicking reasonable ideas to death 
for stupid reasons. And, generally letting the perfect be the enemy of 
the good.


Re: dxml 0.2.0 released

2018-02-12 Thread Jonathan M Davis via Digitalmars-d-announce
On Monday, February 12, 2018 21:26:45 Johannes Loher via Digitalmars-d-
announce wrote:
> On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis
>
> wrote:
> > dxml 0.2.0 has now been released.
> > [...]
>
> Thank you very much for your efforts, I really appreciate it, as
> I have been looking for a decent xml library for quite some time.
>
> Whethr or not this is a candidate for inclusion into phobos is
> certainly up for debate, but as you already mentioned several
> times, this thread is hardly the right place for that.
>
> So instead I'd like to emphasize how much I appreciate you
> working on this and I am sure I am not the only one. This absence
> of a usable high quality xml library is/was a big problem for d
> in my opinion and it is great to see that this is finally being
> worked on :)

Thanks. When you do use it, please give feedback - particularly if you find
any problems or pain points. I definitely think that the API is solid
overall, but that doesn't mean that I got it completely right, and even with
all of the tests that I have, I could have missed something and ended up
with a bug in the parser. I'm reasonably confident in the code quality, but
that doesn't mean that I didn't miss anything.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-12 Thread Jonathan M Davis via Digitalmars-d-announce
On Monday, February 12, 2018 13:51:56 H. S. Teoh via Digitalmars-d-announce 
wrote:
> For example, entity
> support pretty much means plain slices are no longer an option, because
> you have to perform substitution of entity definitions, so you'll have
> to either wrap it in some kind of lazy range that chains the entity
> definition to the surrounding text, or you'l have to use strings or
> something else.  Which means you'll need to have memory allocation /
> slower parsing / whatever, but that's the price of DTD support.

Which was my point. The API as-is doesn't work with DTD support for those
very reasons.

> But again, the point is, basic XML parsing (without DTD support) doesn't
> *need* to pay this price. What's currently in dxml doesn't need to
> change. DTD support can be implemented in a submodule / separate module
> that wraps around dxml and builds DTD support on top of it.
>
> Put another way, we can implement DTD support *on top of* dxml this way:
> - Parse the XML using dxml as an initial step (this can be done lazily,
>   or semi-lazily, as needed).
> - As an intermediate step, parse the DTD section, construct whatever
>   internal state is needed to handle DTD rules, a dictionary of entity
>   references, etc..
> - Filter the output of dxml to insert whatever extra behaviour is needed
>   to implement DTD support before handing it to the calling code, e.g.,
>   expand entity references, or implement validation and throw an
>   exception if validation fails, etc..
>
> *We don't need to change dxml's current API at all.*

I don't think that this works, because the entity references insert new XML
and thus affect the parsing. And as such, you can't simply pass through the
entity references to be processed by another parser. They need to be handled
by the core parser, otherwise it's going to give incorrect results, not just
results that need further parsing. I'm sure that dxml's internals could be
refactored so that they could be shared with another parser that did that,
but unless I'm misunderstanding how entity references work, you can't use
what's there now as-is and build another parser on top of it. The entity
reference replacement needs to happen in the core parser.

> The DTD wrapper doesn't guarantee (and doesn't need to!) to return
> slices of the input like dxml does. I don't see that as a problem, since
> I can't see how anyone would be able to implement full DTD support with
> only slices, even independently from the way dxml is implemented right
> now.

Yeah, if I were writing a parser that handled the DTD section, I wouldn't
make it deal with slices of the input like DTD does unless I decided to make
it always return string, in which case, you could get slices of the original
input for strings but no other range types - it's either that or using a
lazy range, which would be worse if you passed strings but better for other
range types. And that's the main reason that I gave up on having dxml handle
the DTD section. I consider that approach unacceptable. One of the key goals
for dxml was that it would be providing slices of the input and not lazy
ranges or allocating new strings.

In any case, unless I misunderstand how entity references work, that would
have to be its own parser and not simply a wrapper around dxml because of
how the entity references affect the parsing. If I'm wrong, then great,
someone else can come along later and add some sort of DTD parser on top of
dxml, and if I'm right, well, then anyone who wants to do anything like that
is going to need to write a new parser, but that can then coexist alongside
dxml's parser just fine. Either way, I like dxml's approach and don't want
to compromise what it's doing in an attempt to fully deal with DTDs.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-12 Thread H. S. Teoh via Digitalmars-d-announce
On Mon, Feb 12, 2018 at 02:54:48PM +, rikki cattermole via 
Digitalmars-d-announce wrote:
[...]
> Everything you have mentioned is not in Phobos. Just because something
> is 'good enough' does not make it 'good enough' for Phobos. In the
> words of Andrei "Good enough is not good enough", we need to aim
> higher to show what we actually can do.

And thus Phobos continues to let the perfect be the enemy of the good,
and 10 years later std.xml will still be around, and we will still be
arguing over how to replace it.


> Personally I find J.M.D. arguments quite reasonable for a third-party
> library, since yes it does cover 90% of the use cases.

As I have just said in another post, dxml itself does not need to be
changed to implement DTD support.  It's perfectly possible to write a
wrapper on top of it that *does* implement DTD support.  In fact, I dare
say it might be possible to lazily switch from a thin wrapper over dxml
to full DTD mode, so that end users don't even need to care about the
difference if they don't care to.

As far as API is concerned, it could be as simple as something like:

auto parseXml(R, DtdSupport = dtdSupport.true)(R input) if (...)
{
static if (DtdSupport)
return dtdWrapper(dxmlParse(input));
else
return dxmlParse(input);
}

Then just note in the documentation that turning off DTD support would
provide extra features X, Y, and Z (speed, slices, whatever). Then let
the user choose.

Seriously, I would have thought something like this would be obvious to
programmers of the calibre found on these forums.  I'm a little
astonished that this would even be such a point of contention in the
first place, since the solution is so simple.


T

-- 
Many open minds should be closed for repairs. -- K5 user


Re: dxml 0.2.0 released

2018-02-12 Thread H. S. Teoh via Digitalmars-d-announce
On Mon, Feb 12, 2018 at 09:50:16AM -0700, Jonathan M Davis via 
Digitalmars-d-announce wrote:
[...]
> The core problem is that entity references get replaced with more XML
> that needs to be parsed. So, they can't simply be passed on for
> post-processing.  As I understand it, they have to be replaced while
> the parsing is going on.  And that means that you can't do something
> like return slices of the original input that don't bother with the
> entity references and then have a separate parser take that and
> process it further to deal with the entity references. The first
> parser has to deal with them, and that means not returning slices of
> the original input unless you're dealing purely with strings and are
> willing to allocate new strings in the cases where the data needs to
> be mutated because of an entity reference.
[...]

I think you missed my point.

What I'm trying to say is, given the current functionality of dxml, one
*can* build an XML interface that implements DTD support.

Of course, some concessions obviously have to be made, such as needing
to allocate memory (I don't see how else one could keep a dictionary of
DTD rules / entity declarations otherwise, for example), or not being
able to return only slices of the input anymore.  For example, entity
support pretty much means plain slices are no longer an option, because
you have to perform substitution of entity definitions, so you'll have
to either wrap it in some kind of lazy range that chains the entity
definition to the surrounding text, or you'l have to use strings or
something else.  Which means you'll need to have memory allocation /
slower parsing / whatever, but that's the price of DTD support.

But again, the point is, basic XML parsing (without DTD support) doesn't
*need* to pay this price. What's currently in dxml doesn't need to
change. DTD support can be implemented in a submodule / separate module
that wraps around dxml and builds DTD support on top of it.

Put another way, we can implement DTD support *on top of* dxml this way:
- Parse the XML using dxml as an initial step (this can be done lazily,
  or semi-lazily, as needed).
- As an intermediate step, parse the DTD section, construct whatever
  internal state is needed to handle DTD rules, a dictionary of entity
  references, etc..
- Filter the output of dxml to insert whatever extra behaviour is needed
  to implement DTD support before handing it to the calling code, e.g.,
  expand entity references, or implement validation and throw an
  exception if validation fails, etc..

*We don't need to change dxml's current API at all.*

At the most, I anticipate that the only potential change needed is to
expose an interface to parse XML fragments (i.e., not a complete XML
document that contains an outer  tag, but just some PCDATA that may
contain entities or tags) so that the DTD support wrapper can use it to
expand entities and insert any tags that may appear inside the entity
definition.

The DTD wrapper doesn't guarantee (and doesn't need to!) to return
slices of the input like dxml does. I don't see that as a problem, since
I can't see how anyone would be able to implement full DTD support with
only slices, even independently from the way dxml is implemented right
now.

We can even design the DTD support wrapper to start with being just a
thin wrapper around dxml, and lazily switch to full DTD mode only if a
DTD section is encountered.  Then user code that doesn't care to use
dxml's raw API won't even need to care about the difference.


T

-- 
Curiosity kills the cat. Moral: don't be the cat.


Re: dxml 0.2.0 released

2018-02-12 Thread Johannes Loher via Digitalmars-d-announce
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:

dxml 0.2.0 has now been released.
[...]


Thank you very much for your efforts, I really appreciate it, as 
I have been looking for a decent xml library for quite some time.


Whethr or not this is a candidate for inclusion into phobos is 
certainly up for debate, but as you already mentioned several 
times, this thread is hardly the right place for that.


So instead I'd like to emphasize how much I appreciate you 
working on this and I am sure I am not the only one. This absence 
of a usable high quality xml library is/was a big problem for d 
in my opinion and it is great to see that this is finally being 
worked on :)


Re: dxml 0.2.0 released

2018-02-12 Thread Chris via Digitalmars-d-announce

On Monday, 12 February 2018 at 19:47:09 UTC, Jacob Carlborg wrote:

On 2018-02-12 17:49, Chris wrote:

How could it possibly make the situation any worse than it is 
now? Atm,
nobody will ever use std.xml, because it is sub-standard and 
has no future.


I'm using std.xml in a new project right now. It's a really 
small private project that just need to extracts some data from 
an XML document. I started it a couple of days before dxml was 
announced.


A few lines of code that could be replaced easily once something 
better is available? But who will start an important commercial 
project with std.xml when it says in red letters:


"Warning: This module is considered out-dated and not up to 
Phobos' current standards. It will remain until we have a 
suitable replacement, but be aware that it will not remain long 
term."


I for my part wouldn't and I'm glad there's dxml now.




Re: dxml 0.2.0 released

2018-02-12 Thread Jacob Carlborg via Digitalmars-d-announce

On 2018-02-12 17:49, Chris wrote:


How could it possibly make the situation any worse than it is now? Atm,
nobody will ever use std.xml, because it is sub-standard and has no future.


I'm using std.xml in a new project right now. It's a really small 
private project that just need to extracts some data from an XML 
document. I started it a couple of days before dxml was announced.


--
/Jacob Carlborg


Re: dxml 0.2.0 released

2018-02-12 Thread rikki cattermole via Digitalmars-d-announce

On 12/02/2018 3:59 PM, H. S. Teoh wrote:

If std.xml currently does not support DTDs, then I say dxml is
definitely a Phobos candidate.  At the very least, it does not make the
current situation worse.  Rejecting dxml because it doesn't support DTDs
is basically letting the perfect be the enemy of the good, which is
something this community has been plagued with for far too long.  What's
worse: a std.dxml that doesn't support DTDs, or a std.xml with
fundamental problems that continue to plague us for the next decade
while nobody else steps up to implement a suitable replacement?


dxml 7.5k LOC
std.xml 3k LOC

dxml would make the situation a lot worse.


Re: dxml 0.2.0 released

2018-02-12 Thread rikki cattermole via Digitalmars-d-announce

On 12/02/2018 3:50 PM, Jonathan M Davis wrote:

In any case, I'm going to finish implementing dxml without any kind of DTD
support and then see how things go as far as the Phobos review process goes.
If dxml gets rejected, because the majority of folks think that we're better
off with std.xml (or no xml parser at all in Phobos) than one that doesn't
have DTD support, then oh well. That sucks, but anyone who wants dxml can
then use it as a 3rd party library. I think that the D community would be
worse off because of that, but it's not ultimately my decision to make, and
either way, I have the parser that I need.


We are definitely not better off with just std.xml currently.

The problem comes from the word currently. By going into Phobos even if 
experimental, its going to be around for a while in some form or 
another. So we need to invest a decent amount of time into not creating 
more problems for new users expecting the world and not getting it.


If somebody (say a student?) were to write up a proper API and use dxml 
as a basis for a simpler parser, now that could be a worth while project 
and definitely could go into Phobos.


I may even consider doing it at some point in the future.


Re: dxml 0.2.0 released

2018-02-12 Thread H. S. Teoh via Digitalmars-d-announce
On Mon, Feb 12, 2018 at 07:04:38AM -0700, Jonathan M Davis via 
Digitalmars-d-announce wrote:
[...]
> However, if folks as a whole think that Phobos' xml parser needs to
> support the DTD section to be acceptable, then dxml won't replace
> std.xml, because dxml is not going to implement DTD support. DTD
> support fundamentally does not fit in with dxml's design.

Actually, thinking about this, I'm wondering if a combination of
preprocessing and/or postprocessing might make it possible to implement
DTD support without needing to rewrite the guts of dxml. AIUI, dxml does
parse the DTD section correctly, i.e., as an XML directive, but only
doesn't look into its internal details. So one way to implement DTD
support might be:

- Write an auxiliary parser that's basically a wrapper around dxml,
  forwarding XML events to the caller, except:
- If a DTD event is encountered, eagerly parse it, store DTD
  declarations internally for future reference.
- If there's a DTD that has been seen, perform on-the-fly validation as
  XML events are forwarded.
- In PCDATA sections, if there are entity references to the DTD, expand
  them, possibly inserting more XML events into the stream based on
  what's defined in the DTD. (This may need to reuse some dxml internals
  to parse XML snippets that might be contained in an entity definition,
  for example.)


[...]
> However, std.xml does not support the DTD section, and glancing over
> it, it doesn't look like it even handles skipping the DTD section
> properly (it doesn't handle the fact that '>' can appear within quoted
> sections within the DTD). So, dxml is not worse than std.xml in that
> regard, and we wouldn't lose any functionality by having dxml replace
> std.xml. It just wouldn't necessarily do as much as some folks might
> like.
[...]

If std.xml currently does not support DTDs, then I say dxml is
definitely a Phobos candidate.  At the very least, it does not make the
current situation worse.  Rejecting dxml because it doesn't support DTDs
is basically letting the perfect be the enemy of the good, which is
something this community has been plagued with for far too long.  What's
worse: a std.dxml that doesn't support DTDs, or a std.xml with
fundamental problems that continue to plague us for the next decade
while nobody else steps up to implement a suitable replacement?


T

-- 
Ph.D. = Permanent head Damage


Re: dxml 0.2.0 released

2018-02-12 Thread Jonathan M Davis via Digitalmars-d-announce
On Monday, February 12, 2018 15:45:50 bachmeier via Digitalmars-d-announce 
wrote:
> On Monday, 12 February 2018 at 15:43:59 UTC, bachmeier wrote:
> > On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis
> >
> > wrote:
> >> However, if folks as a whole think that Phobos' xml parser
> >> needs to support the DTD section to be acceptable, then dxml
> >> won't replace std.xml, because dxml is not going to implement
> >> DTD support. DTD support fundamentally does not fit in with
> >> dxml's design.
> >
> > Can't you simply give it a name other than std.xml that
> > indicates it doesn't do everything related to xml? It doesn't
> > make sense to not put it into Phobos because of the name, and
> > that should be an easy problem to solve.
>
> Hit send too fast. std.xml.base would be reasonable.

I have no interest in bikeshedding the name right now or even really arguing
about Phobos inclusion (I've already said more in this thread about that
than I probably should have). That can be left up to the review process,
which already tends to be nasty enough that it wouldn't surprise me at all
if dxml doesn't get accepted. The only reason that I have any plans to try
for Phobos inclusion with dxml is because std.xml needs to be replaced. If
Phobos didn't have an XML parser already, I don't expect that I'd bother,
since I don't think that it's all that important that a standard library
have an XML parser. I just think that it's important that it not have have a
bad one. In general, I think that XML is the sort of thing that's perfectly
fine as a 3rd party solution.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-12 Thread Jonathan M Davis via Digitalmars-d-announce
On Monday, February 12, 2018 15:26:24 rikki cattermole via Digitalmars-d-
announce wrote:
> All J.M.D. has to do to change this, is make the API match the spec (as
> close as possible, without writing another parser) and separate out the
> implementation into a different and very clear module (probably a sub
> package) which states clearly that it is a subset with the full grammar
> listed that it supports.

That literally cannot be done. dxml returns slices (or takeExactly's) of the
original input. For it to do otherwise would harm performance and usability,
but in order to implement full DTD support, it's impossible to return slices
of the original input in the general case, because you have to be able to
mutate the data whenever entity references get involved. If the API were
entirely string-based, then whether the implementation returned slices or
newly allocated strings could be an implementation detail, but as soon as
you're dealing with arbitrary ranges of characters, that doesn't work. At
that point, you're forced to either return strings for everything (which
means allocating for any ranges that aren't strings) or to return a lazy
range of characters and thus can't return the original type. And that means
that if you pass it a string, you're stuck with a lazy range out the other
end instead of a string, and to get a string again, you have to allocate,
whereas with what I have now, the parser does almost no allocations, and as
long as the input type supports slicing, you get exactly the same type out
the other end, which is a huge usabality improvement IMHO.

So, you can't have DTD support with the kind of API that dxml has, and
changing the API to something that could work with DTD support would harm
the parser for all of the cases where DTD support is unnecessary.

Even if I were going to implement full DTD support, I would do it with
another parser, not change the parser that dxml already has. And if dxml
ends up in Phobos with the parser that it has, that doesn't prevent another
parser from being added for the DTD case later if someone actually decides
to put in the time and effort to do it. Either way, for any XML document
that doesn't need DTD support, the way that dxml does things is more
efficient and user-friendly than one that had DTD support would be, much as
that obviously doesn't cut it for those documents that do need DTD support.

In any case, I'm going to finish implementing dxml without any kind of DTD
support and then see how things go as far as the Phobos review process goes.
If dxml gets rejected, because the majority of folks think that we're better
off with std.xml (or no xml parser at all in Phobos) than one that doesn't
have DTD support, then oh well. That sucks, but anyone who wants dxml can
then use it as a 3rd party library. I think that the D community would be
worse off because of that, but it's not ultimately my decision to make, and
either way, I have the parser that I need.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-12 Thread bachmeier via Digitalmars-d-announce

On Monday, 12 February 2018 at 15:43:59 UTC, bachmeier wrote:
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis 
wrote:


However, if folks as a whole think that Phobos' xml parser 
needs to support the DTD section to be acceptable, then dxml 
won't replace std.xml, because dxml is not going to implement 
DTD support. DTD support fundamentally does not fit in with 
dxml's design.


Can't you simply give it a name other than std.xml that 
indicates it doesn't do everything related to xml? It doesn't 
make sense to not put it into Phobos because of the name, and 
that should be an easy problem to solve.


Hit send too fast. std.xml.base would be reasonable.


Re: dxml 0.2.0 released

2018-02-12 Thread bachmeier via Digitalmars-d-announce
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis 
wrote:


However, if folks as a whole think that Phobos' xml parser 
needs to support the DTD section to be acceptable, then dxml 
won't replace std.xml, because dxml is not going to implement 
DTD support. DTD support fundamentally does not fit in with 
dxml's design.


Can't you simply give it a name other than std.xml that indicates 
it doesn't do everything related to xml? It doesn't make sense to 
not put it into Phobos because of the name, and that should be an 
easy problem to solve.


Re: dxml 0.2.0 released

2018-02-12 Thread rikki cattermole via Digitalmars-d-announce

On 12/02/2018 3:08 PM, Adam D. Ruppe wrote:

On Monday, 12 February 2018 at 14:54:48 UTC, rikki cattermole wrote:
Just because something is 'good enough' does not make it 'good enough' 
for Phobos. In the words of Andrei "Good enough is not good enough", 
we need to aim higher to show what we actually can do.


About 5 years ago (I think, I actually have the link on my other 
computer but it is 2,000 miles away right now), Andrei said something 
along the lines of "without the review process, we get junk like std.json".


Ironically, that same review process may be why we still have such 
"junk". (actually personally, I don't hate std.json).


If std.xml is really so bad and has been for so long, surely we ought to 
take an opportunity to change that, even if the change isn't perfect.


It depends.

The implementation does not need to be perfect or full fledged to go 
into experimental.


But if at the start of the review process it is already well known that 
the public API would require a complete change to accommodate the 
intended goal it is unacceptable.


Take std.experimental.allocators as an example. It currently is going 
through a massive API change, but when it first got PR'd, did we know 
that we should be RC'ing allocators? No of course not, otherwise we'd 
have done it.


At this point in time I cannot say that dxml in good faith serves to 
represent the XML specification for the D community in full. This is 
unfortunately not about bike shedding.


It is one thing to bike shed features, but when scope does not match the 
intended goal, we have got to be careful about what goes into Phobos.


All J.M.D. has to do to change this, is make the API match the spec (as 
close as possible, without writing another parser) and separate out the 
implementation into a different and very clear module (probably a sub 
package) which states clearly that it is a subset with the full grammar 
listed that it supports.


That way everybody is clear and we can later on get a full 
implementation as part of taking it out of experimental :)


Re: dxml 0.2.0 released

2018-02-12 Thread Adam D. Ruppe via Digitalmars-d-announce
On Monday, 12 February 2018 at 14:54:48 UTC, rikki cattermole 
wrote:
Just because something is 'good enough' does not make it 'good 
enough' for Phobos. In the words of Andrei "Good enough is not 
good enough", we need to aim higher to show what we actually 
can do.


About 5 years ago (I think, I actually have the link on my other 
computer but it is 2,000 miles away right now), Andrei said 
something along the lines of "without the review process, we get 
junk like std.json".


Ironically, that same review process may be why we still have 
such "junk". (actually personally, I don't hate std.json).


If std.xml is really so bad and has been for so long, surely we 
ought to take an opportunity to change that, even if the change 
isn't perfect.


Re: dxml 0.2.0 released

2018-02-12 Thread Adam D. Ruppe via Digitalmars-d-announce
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis 
wrote:
XML parsers are one of those things that everyone seems to want 
and no one seems to want to work on.


I wrote one 8 years ago... though mine is more focused on HTML 
parsing, and the XML aspect is just a side effect!


Re: dxml 0.2.0 released

2018-02-12 Thread rikki cattermole via Digitalmars-d-announce

On 12/02/2018 2:45 PM, Chris wrote:

On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis wrote:
On Monday, February 12, 2018 12:38:51 Chris via Digitalmars-d-announce 
wrote:

On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis




However, std.xml does not support the DTD section, and glancing over 
it, it doesn't look like it even handles skipping the DTD section 
properly (it doesn't handle the fact that '>' can appear within quoted 
sections within the DTD). So, dxml is not worse than std.xml in that 
regard, and we wouldn't lose any functionality by having dxml replace 
std.xml. It just wouldn't necessarily do as much as some folks might 
like.


I thought the same when I glanced over std.xml. There's no DTD support 
there either and I don't think it would be a deal breaker for most users.


My guess is that DTD support won't be a deal breaker given that 
std.xml doesn't support it, that std.xml has needed to be replaced for 
years now, and that no one else is working on replacing it, but I 
don't know. Disagreements over what should be done with std.json's 
replacement has meant that it has never been replaced even though 
significant work was done towards replacing it, so unfortunately, 
there's already precedence for a module not being replaced with 
something better due to disagreements over what the replacement would 
ideally be. So, I don't know.


- Jonathan M Davis


Wasn't there a replacement module that never got past the initial review 
steps? Some GSoC thing or so. But I wonder if that module would be up to 
the latest D standards.


https://github.com/dlang-community/experimental.xml

Code isn't great, and not complete yet.
Author has just disappeared sadly.

While one may argue that DTD support is important, I would rather have 
something fast and simple like dxml that covers, say, 90% of the cases 
than nothing. It doesn't make sense to me that we should accept the 
current situation, only because of some bikeshedding that concerns 10% 
of the use cases. After all, it's only a module not a fundamental 
decision that concerns the direction D will take in the future. I think 
stuff like that can seriously turn off potential users. A lot of useful 
things begin with one person deciding to give it a go. vibe.d, dub, 
DScanner and DlangUI, for example. If the creators had started 
bikeshedding before writing the first line of code, there would still be 
a flamewar about the best way to go about it - and nothing would have 
happened.


Everything you have mentioned is not in Phobos. Just because something 
is 'good enough' does not make it 'good enough' for Phobos. In the words 
of Andrei "Good enough is not good enough", we need to aim higher to 
show what we actually can do.


Personally I find J.M.D. arguments quite reasonable for a third-party 
library, since yes it does cover 90% of the use cases.


Re: dxml 0.2.0 released

2018-02-12 Thread Chris via Digitalmars-d-announce
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis 
wrote:
On Monday, February 12, 2018 12:38:51 Chris via 
Digitalmars-d-announce wrote:

On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis




However, std.xml does not support the DTD section, and glancing 
over it, it doesn't look like it even handles skipping the DTD 
section properly (it doesn't handle the fact that '>' can 
appear within quoted sections within the DTD). So, dxml is not 
worse than std.xml in that regard, and we wouldn't lose any 
functionality by having dxml replace std.xml. It just wouldn't 
necessarily do as much as some folks might like.


I thought the same when I glanced over std.xml. There's no DTD 
support there either and I don't think it would be a deal breaker 
for most users.


My guess is that DTD support won't be a deal breaker given that 
std.xml doesn't support it, that std.xml has needed to be 
replaced for years now, and that no one else is working on 
replacing it, but I don't know. Disagreements over what should 
be done with std.json's replacement has meant that it has never 
been replaced even though significant work was done towards 
replacing it, so unfortunately, there's already precedence for 
a module not being replaced with something better due to 
disagreements over what the replacement would ideally be. So, I 
don't know.


- Jonathan M Davis


Wasn't there a replacement module that never got past the initial 
review steps? Some GSoC thing or so. But I wonder if that module 
would be up to the latest D standards.


While one may argue that DTD support is important, I would rather 
have something fast and simple like dxml that covers, say, 90% of 
the cases than nothing. It doesn't make sense to me that we 
should accept the current situation, only because of some 
bikeshedding that concerns 10% of the use cases. After all, it's 
only a module not a fundamental decision that concerns the 
direction D will take in the future. I think stuff like that can 
seriously turn off potential users. A lot of useful things begin 
with one person deciding to give it a go. vibe.d, dub, DScanner 
and DlangUI, for example. If the creators had started 
bikeshedding before writing the first line of code, there would 
still be a flamewar about the best way to go about it - and 
nothing would have happened.


Re: dxml 0.2.0 released

2018-02-12 Thread Jonathan M Davis via Digitalmars-d-announce
On Monday, February 12, 2018 12:38:51 Chris via Digitalmars-d-announce 
wrote:
> On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis
>
> wrote:
> > dxml 0.2.0 has now been released.
> >
> > I really wasn't planning on releasing anything this quickly
> > after announcing dxml, but when I went to start working on DOM
> > support, it turned out to be surprisingly quick and easy to
> > implement. So, dxml now has basic DOM support.
> >
> > [...]
>
> Will this replace `std.xml` one day?

Maybe. That depends on community feedback and ultimately on the Phobos
review process. Assuming that there's support for putting it through the
Phobos review process, then once I feel that it's complete enough and had
enough use to make it clear that I didn't miss something critical, then I'll
submit it for review.

What little feedback there has been thus far has been positive, but it would
be nice to get it battle-tested a bit, and there is still functionality that
I need to add.

Given that std.xml needs to be replaced, I think that it would be good if
dxml were able to do that, but that depends heavily on what others think of
what I've done and what they think Phobos' xml solution should look like.
But the way things are going though, if dxml doesn't replace std.xml, I
don't know that anything ever will. XML parsers are one of those things that
everyone seems to want and no one seems to want to work on.

However, if folks as a whole think that Phobos' xml parser needs to support
the DTD section to be acceptable, then dxml won't replace std.xml, because
dxml is not going to implement DTD support. DTD support fundamentally does
not fit in with dxml's design. Someone would basically have to write an
entirely new parser to be able to handle it (some of dxml's internals could
be reused, but they'd also have to be refactored a fair bit, and a ton of
extra stuff would have to be added). Such a parser could theoretically
coexist with dxml's parser, since each would provide its own advantages, but
I have no plans to implement an XML parser to handle the DTD section. It's
simply not worth my time or effort, and this project has already taken way
more time and effort than I anticipated.

However, std.xml does not support the DTD section, and glancing over it, it
doesn't look like it even handles skipping the DTD section properly (it
doesn't handle the fact that '>' can appear within quoted sections within
the DTD). So, dxml is not worse than std.xml in that regard, and we wouldn't
lose any functionality by having dxml replace std.xml. It just wouldn't
necessarily do as much as some folks might like.

My guess is that DTD support won't be a deal breaker given that std.xml
doesn't support it, that std.xml has needed to be replaced for years now,
and that no one else is working on replacing it, but I don't know.
Disagreements over what should be done with std.json's replacement has meant
that it has never been replaced even though significant work was done
towards replacing it, so unfortunately, there's already precedence for a
module not being replaced with something better due to disagreements over
what the replacement would ideally be. So, I don't know.

- Jonathan M Davis



Re: dxml 0.2.0 released

2018-02-12 Thread Chris via Digitalmars-d-announce
On Monday, 12 February 2018 at 12:49:30 UTC, rikki cattermole 
wrote:

On 12/02/2018 12:38 PM, Chris wrote:
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:

dxml 0.2.0 has now been released.

I really wasn't planning on releasing anything this quickly 
after announcing dxml, but when I went to start working on 
DOM support, it turned out to be surprisingly quick and easy 
to implement. So, dxml now has basic DOM support.


[...]


Will this replace `std.xml` one day?


As long as DTD support is essentially non-existent, my vote 
will always be no.


How hard would it be to add DTD support? One could take dxml and 
extend it in order to include it in Phobos. I haven't used 
`std.xml` for years now. It is essentially dead and unusable atm.


Re: dxml 0.2.0 released

2018-02-12 Thread rikki cattermole via Digitalmars-d-announce

On 12/02/2018 12:38 PM, Chris wrote:

On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:

dxml 0.2.0 has now been released.

I really wasn't planning on releasing anything this quickly after 
announcing dxml, but when I went to start working on DOM support, it 
turned out to be surprisingly quick and easy to implement. So, dxml 
now has basic DOM support.


[...]


Will this replace `std.xml` one day?


As long as DTD support is essentially non-existent, my vote will always 
be no.


Re: dxml 0.2.0 released

2018-02-12 Thread Chris via Digitalmars-d-announce
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:

dxml 0.2.0 has now been released.

I really wasn't planning on releasing anything this quickly 
after announcing dxml, but when I went to start working on DOM 
support, it turned out to be surprisingly quick and easy to 
implement. So, dxml now has basic DOM support.


[...]


Will this replace `std.xml` one day?


Re: dxml 0.2.0 released

2018-02-11 Thread Aravinda VK via Digitalmars-d-announce
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:

dxml 0.2.0 has now been released.

I really wasn't planning on releasing anything this quickly 
after announcing dxml, but when I went to start working on DOM 
support, it turned out to be surprisingly quick and easy to 
implement. So, dxml now has basic DOM support.


As part of that, it became clear that dxml.parser.stax should 
be renamed to dxml.parser, since it's really the only parser 
(DOM support involves just providing a way to hold the results 
of the parser, not any actual parsing, and that's clear from 
the API rather than being an implementation detail), and it 
makes for a shorter import path. So, I figured that I should do 
a release sooner rather than later to reduce how many folks the 
rename ends up affecting.


For this release, dxml.parser.stax is now an empty, deprecated, 
module that publicly imports dxml.parser, but it will be 
removed in 0.3.0, whenever that is released. So, the few folks 
who grabbed the initial release won't end up with immediate 
code breakage if they upgrade.


One nice side effect of how I implemented DOM support is that 
it's trivial to get the DOM for a portion of an XML document 
rather than the entire thing, since it will produce a DOMEntity 
from any point in an EntityRange.


Documentation: http://jmdavisprog.com/docs/dxml/0.2.0/
Github: https://github.com/jmdavis/dxml/tree/v0.2.0
Dub: http://code.dlang.org/packages/dxml

- Jonathan M Davis


Awesome. Just tried it now as below and it works. Thanks for this 
library


import std.stdio;

import dxml.dom;

struct Record
{
string name;
string email;
}


Record[] parseRecords(string xml)
{
Record[] records;
auto d = parseDOM!simpleXML(xml);
auto root = d.children[0];

foreach(record; root.children)
{
auto rec = Record();
foreach(ele; record.children)
{
if (ele.name == "name")
rec.name = ele.children[0].text;
if (ele.name == "email")
rec.email = ele.children[0].text;
}
records ~= rec;
}

return records;
}

void main()
{
auto xml = "\n" ~
"\n" ~
"N1\n" ~
"E1\n" ~
"\n" ~
"\n" ~
"N2\n" ~
"E2\n" ~
"\n" ~
"\n" ~
"E3\n" ~
"N3\n" ~
"\n" ~
"\n" ~
"";
auto records = parseRecords(xml);
writeln(records);
}