Andrew Lentvorski wrote: > Christopher Smith wrote: >> Andrew Lentvorski wrote: >> >First, parsers are *hard*. Every idiot CS major thinks he a can write >> >a parser for his "little language". They are all wrong. >> >> Perhaps, but writing a grammar for your "little language" and then >> using a parser generator of some kind is not that difficult. > > Real-world results are against you here ... ;) > > I have yet to get see a grammar specification for any interchange format > I work with other than XML.
So you don't have any real world results for writing a grammar.... Seriously, it's easy, and I'm sure you know that. >> Usually when I write a parser, like with most code, it's a painful >> iterative process where I write code that doesn't pass tests and >> then fix it. What you are describing is a defect in a development >> process that won't be fixed by using XML. The problem will just >> be moved elsewhere. > > No, it doesn't necessarily get fixed. However, it allows those of us > who *do* care to be able to cope with the idiocy. I would argue you can cope with the idiocy regardless. However, it is fair to say that XML does provide some buffers against certain types of idiocy. That said, there are several other approaches which do a much better job of buffering idiocy. >> >XML *forces* these morons to have to interface with a >> >structured, debugged parser. >> >> Ah, I see you've not yet encountered Perl hackers who use regexps >> to extract data from XML and HTML documents. Trust me, there >> is no "forcing" going on. > > Oh, yes, I have. They start with "just a quick hack"--and it works, > mostly. It just needs a few hand tweaks. And then that script gets > bigger and needs a couple more tweaks. Lather, rinse, repeat--until you > have a totally unintelligible mess. Yup. It's a joy to behold.... not. > Actually, this is one of my interview questions for VLSI CAD tool > administrators. I give them a structured file (slightly modified Spice > simulator input deck) and ask them to write code to cope with it. Those > who use regexes fail--they invariably have silent failure modes (*very* > bad when your script may be a check which has to prevent a $1 million > mistake). It's weird, any regexp library I've seen has a "match" operation that can and does fail when it doesn't get a match. That said, trying to use a regexp to parse a file format is an incredible pain to get right. > Those who use a parsing library and build a tree or those who > transform it to XML and use DOM/SAX pass. Ugh... if they transform it in to XML I'd be tempted to fail them. >> First, not all XML parsers throw exceptions (indeed it's hard >> to find a C parser that does ;-). Secondly, unless you are talking >> about a failure during validation, SAX and DOM tend to fail for >> the same reasons that read() and write() fail. If you are talking >> about validation, you have a point, but unfortunately most folks >> using SAX and DOM don't use a DTD or Schema, and therefore >> no validation, and so SAX and DOM primarily serve as glorified >> lexers, rather than parsers. In many ways, programmers are >> actually *less* likely to be forced to define a grammar than if >> they were living in an XML-less world. > > Well, since my sample size for those who defined a grammar is *zero*, I > am in the same boat either way. > > The difference is that with XML, *I* can create the DTD or Schema to do > validation even if the original author doesn't. This allows me to trap > badly formed incoming data *before* it has a chance to enter my system. Regardless of whether it's written in XML, you can write a grammar for what you perceive is this undocumented format and use it to validate data. Unfortunately, much as with a DTD or Schema that's created in a black box scenario, you might end up with some false negatives. >> It's beautiful on one hand and painful on the other. Certainly >> this has a negative impact on parsing performance, which has >> caused me no end of trouble --ironically to the point where I've >> been forced to write my own parsers. > > Wow. So why was your parser so much faster and why couldn't the normal > system do that? Tons of reasons, probably the most obvious being that I could make assumptions about the encoding of the data. Other nice things in my parser included able to have a DFA for all the data structures, rather than for just matching XML tagging structures and then having to parse the individual elements of text after SAX or DOM had already been through it. >> Furthermore, Unicode is it's own messy kettle of fish. > Sure. Internationalization is *hard*. Unicode needed some very smart > people working for quite a lot of years to produce the standard that we > have. ...and like all committee standards it tends to be bloated and leave everyone a little short changed. There are lots of issues with Unicode that exist only in pathological circumstances. When those circumstances apply to your situation you are grateful for it, but at other times it can be a pain. In general, I'm not suggesting Unicode is a bad thing, but I think mandating unicode perhaps exposes XML to several more bits of complexity that could otherwise be avoided. I say this speaking as someone who's had XML parsers die simply because of a meaningless unicode endian flag at the beginning of a document. >> So you end up spending a lot of time >> converting back and forth between your apps "native" string >> library and that of your parser. > > Really? Ouch. I don't tend to hit that problem. However, I tend to > only use languages that have native Unicode String types. What languages are those? Certainly not C or C++ (C++'s std::string can be templated so as to be able to hold Unicode values, but it really can't be called a Unicode String type). JDK 1.5 finally has something that resembles the Unicode standard, but all the JDK 1.4 and prior compatible code (read: almost all the Java code out there) still is laced with assumptions about characters being able to be represented as 16-bit values. Often folks still use ICU with Java to avoid various issues with Java's implementation. Perl 5.8 finally managed to get something that is workable (although it still doesn't implement the Unicode standard), but it tends to be enough of a hack that you can find articles detailing the things that can go wrong with Unicode and Perl 5.8. Python didn't get Unicode support at all until 1.6 and and as far as I know still has issues with it. I've had to try to get Unicode data moving back and forth between Oracle databases, message queuing software, and tools written in different languages, and it's been my experience each transition for one tool to another involved some lovely compatibility issues, often even when just using the old character-set world would have made it pretty easy. > If an XML parser made me convert away from that, I just wouldn't use it. Hehe, welcome to my hell. ;-) >> It's great and all to think about Unicode when you are writing >> code, but it's a pain when you have to parse an 800GB file that's >> all ASCII. > > Heh. For me, Unicode isn't the problem. Most of my data is numeric. > Writing out 10^8 polygon coordinates in UTF-8 is where all the bloat is. > I'll still take XML, though, so that the human readable *attributes* on > those polygons are internationalized. If the encoding is UTF-16, that doubles the bloat of your 10^8 polygon coordinates above and beyond the already substanial pain of decimal conversion (and I hope you have some means of specifying how all the binary floating-point to decimal floating-point rounding errors are handled, because XML sure isn't going to help you). If you use UTF-8 then that potentially adds about 50% to the bloat of your attributes, depending on where they are from. UTF-32 manages to make all of it suck pretty bad. Then there's the joy of having to base64 encode any binary data (great, now we have a non-human readable human readable format ;-), unless you want to use some of the binary XML standards which tend not to work so well between different tools. If done properly, it's possible to have a sufficiently large data set that can't be rendered into a DOM on a 32-bit machine, but can be rendered in a parse tree for a much more well defined format. >> >And herein lies the source of the XML verbosity that everybody >> >complains about--balanced close tags. >> >> You will note that JSON also uses balanced tags. I guess the >> closing tags are named, which is what you are getting at. > > Yes, sorry, that wasn't explicit. Named closing tags seem to be *the* > feature that made XML take off. Even with smart editors, for some > reason people never took to s-expressions but did take to XML. The Lisp > folks whine about this continuously. Actually this aspect of XML is inherited from SGML. Ironically SGML failed to take off because it wasn't simple enough, and really required a DTD be defined for your document before a tool could effectively use it. Ironic, eh? > I really think this is the key. The named closing tags guarantee a > certain amount of error localization. You obviously haven't worked with sufficiently idiotic people. ;-) >> I'm sorry, but the only reason I can imagine for verbose closing >> tags to help catch a syntax error would be if a human was >> generating them, and even then primarily it'd be because they >> made a typeoh when typing in the name of the closing tag, >> which means you're getting a lot of otherwise unnecessary >> syntax errors. Sure making people type more means you catch >> more syntactic errors, but that's not an improvement if those >> errors are *caused* by them having to type more. > > Or, a buggy interchange generator. This is the bigger problem. Quite > often one of the lesser used code paths in the generator acquires a bug. > Without named closing tags, this bug can slide past a lot of parsers as > it occurs infrequently. Even then, it can be really hard to track down > since it can occur a long way from where it was indicated. The same can and does occur with XML, particularly if you absent a DTD or Schema. >> I'd argue that the problems one encounters with a parser are >> pretty much entirely in the "nasty" corners you are talking >> about. Once you throw away all that stuff what you are left >> with is little more than lexer with some notion of hierarchical >> structures. That is of questionable benefit given the price >> you pay for using XML. > > Here we disagree. The fact that I *can* activate the nasty stuff later > on is more than sufficient for me to incur the overhead. No reason why those kinds of options are thrown away if you don't use XML either though. >> The most hilarious part to me about XML is the "extensible" >> part of it. If I had a dime for every program that I've seen that >> works with XML but starts spewing errors as soon as you >> "extend" a document with some new tags, I'd be rich. > > Oh, yeah. Been there--seen that. > > However, XML tends to be more extensible than *anything* else. It > requires quite a bit of work to create an older, readable XML document > that cannot be read by newer parsers. Backward compatibility requires a > bit of thought but normally not *too* much. > > Unfortunately, even that small amount of brainpower seems to exceed what > most programmers are capable of. Again, I think you give XML too much credit. If you want to design a format that is extensible, it's not hard to do it. If you fail to consider that option, you end up with what most XML protocols end up being. >> More importantly though, it's worth pointing out that this >> XML crud is increasingly being used for stuff that is only >> read and written by machines, partly due to its spiraling >> complexity which makes use by humans too painful. I have >> to wonder when someone is going to ask if perhaps it makes >> sense to pass around numbers between computers in a format >> they understand, instead of an error prone format that >> requires so much effort to parse. > > Sure, that's what ASN.1 is for. You know, one might think that, but strangely people keep on using XML. I think a lot of times it's so they can say they have "Web Services". > I, however, haven't seen the spiraling XML complexity. I use XML for > what it should be used as--an interchange format. XML should be used as an interoperable way of representing structured documents, much as it's predecessor SGML was meant to be used. Unfortunately, it's being used a lot as an interchange format, resulting in significantly increasing complexity. Things like namespaces and XML Schema are necessary to properly use it as an interchange format. > I don't try to make it into a database; I don't try to give it > semantic meaning; I don't try to index it using completely unrelated > tools. It's worth noting that all of those things are being done with XML, and the W3C and others have created standards for most of them. Nonetheless, my problems with XML stem from it being used for data interchange. > Absolutely. XML is not magic. In fact, all of the problems it "solves" > have been solved before. The difference is that XML allows people to > apply pressure to the idiots in a simple way--"Does it talk XML? No? > Come back when it does." Yes, the key advantage of XML is the idiot buy-in factor. If you did s/XML/ASN.1/, they'd say you were being rediculous. Unfortunately, whenever you try to make something idiot proof, they just build a better idiot, which is exactly what happens when you pressure someone to provide an XML interface to their tool. :-( --Chris -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg
