Andrew Lentvorski wrote: >First, parsers are *hard*. Every idiot CS major thinks he a can write >a parser for his "little language". They are all wrong.
Perhaps, but writing a grammar for your "little language" and then using a parser generator of some kind is not that difficult. >Every parser they create is a broken piece of sh*t. It doesn't get >debugged; it doesn't get tested; it gets thrown out in 12 months >for another one. Usually when I write a parser, like with most code, it's a painful iterative process where I write code that doesn't pass tests and then fix it. What you are describing is a defect in a development process that won't be fixed by using XML. The problem will just be moved elsewhere. >XML *forces* these morons to have to interface with a >structured, debugged parser. Ah, I see you've not yet encountered Perl hackers who use regexps to extract data from XML and HTML documents. Trust me, there is no "forcing" going on. >SAX and DOM have their faults, but at least they get debugged. I could swear most regexp libraries and parser generators get debugged too. Strangely this doesn't prevent bugs from cropping up in the code of people who use them. >Watching programmers writhe in agony because the XML >parser threw an exception on a boundary case that their >puny little minds are too narrow to anticipate is a most >rewarding experience. First, not all XML parsers throw exceptions (indeed it's hard to find a C parser that does ;-). Secondly, unless you are talking about a failure during validation, SAX and DOM tend to fail for the same reasons that read() and write() fail. If you are talking about validation, you have a point, but unfortunately most folks using SAX and DOM don't use a DTD or Schema, and therefore no validation, and so SAX and DOM primarily serve as glorified lexers, rather than parsers. In many ways, programmers are actually *less* likely to be forced to define a grammar than if they were living in an XML-less world. >Second, internationalization is hard. I missed something. What has i18n got to do with XML or JSON? >How many ways are there to spell Tchaikovsky? I believe there is only one Cyrillic spelling. There are several different ways to translate it in to other alphabets. >The same morons from above get *forced* into dealing >with this kind of crud with XML when they bump into >another program which refuses to accept that Author, >Composer, etc is a unique key. Not really. Once you have Unicode it's actually much easier to make such things a unique key. The pain point here normally comes from a failure to recognize that when translations between alphabets are a one-to-one mapping. If you either make the alphabet part of the key, or use unicode, then you are good. >And the whole fact that XML *specifies* Unicode is >beautiful--no more slacking off and only accepting ASCII >or, worse, only accepting letters and digits. It's beautiful on one hand and painful on the other. Certainly this has a negative impact on parsing performance, which has caused me no end of trouble --ironically to the point where I've been forced to write my own parsers. Furthermore, Unicode is it's own messy kettle of fish. Most programming languages have a native string library that isn't 100% compatible with Unicode. Those that are compatible tend to be because the original string library has been hacked up to handle the changes in the Unicode standard. The end result is that most XML parsers tend to use their own string library and often enforce a particular encoding for parsed strings (and the joy of Unicode is that any given encoding is bound to be really inefficient for someone ;-). So you end up spending a lot of time converting back and forth between your apps "native" string library and that of your parser. It's great and all to think about Unicode when you are writing code, but it's a pain when you have to parse an 800GB file that's all ASCII. >Third, XML parsers *complain* when you feed them garbage. You must be using some strange regexp libraries and parser generators. ;-) >And herein lies the source of the XML verbosity that everybody >complains about--balanced close tags. You will note that JSON also uses balanced tags. I guess the closing tags are named, which is what you are getting at. I would agree that it can be helpful when dealing with human generated XML docs, but for the most part people find it a pain to generate XML, so it is generated by machines who tend to balance things out automatically. >Syntax errors almost always *immediately* cause parsing >errors because they tend to bump into unbalanced tags; no >silent degradation here--I approve. I'm sorry, but the only reason I can imagine for verbose closing tags to help catch a syntax error would be if a human was generating them, and even then primarily it'd be because they made a typeoh when typing in the name of the closing tag, which means you're getting a lot of otherwise unnecessary syntax errors. Sure making people type more means you catch more syntactic errors, but that's not an improvement if those errors are *caused* by them having to type more. In reality, most syntax errors are more complex than not balancing tags, and really tend to only be caught by things like a grammar or a XML Schema (a DTD will get you half way there). Often that's not even enough and the problem can only be identified by semantic analysis. As a consequence even when using XML, a programmer needs to spend about the same amount of time programming defensively to weed out syntactic errors. >...can't deal with the fact that almost nothing in real life is >a useful unique key... Okay, first there seems to be an odd assumption here that you actually need a unique key. Often you don't. When you do, it is entirely possible to have unique keys based on "real life". The trick is making sure you define the system in such a way that the uniqueness rules make sense. Saying names are unique might seem foolish, unless your database is made up of the trade names of active members of SAG. Either way, XML really doesn't have anything to say about unique keys. >I can avoid most of the gnarly, nasty corners of XML >(namespaces and schemas/DTD's) while still retaining most >of the advantages all while knowing that the gnarly, nasty >stuff is available if I really need it. I'd argue that the problems one encounters with a parser are pretty much entirely in the "nasty" corners you are talking about. Once you throw away all that stuff what you are left with is little more than lexer with some notion of hierarchical structures. That is of questionable benefit given the price you pay for using XML. The most hilarious part to me about XML is the "extensible" part of it. If I had a dime for every program that I've seen that works with XML but starts spewing errors as soon as you "extend" a document with some new tags, I'd be rich. More importantly though, it's worth pointing out that this XML crud is increasingly being used for stuff that is only read and written by machines, partly due to its spiraling complexity which makes use by humans too painful. I have to wonder when someone is going to ask if perhaps it makes sense to pass around numbers between computers in a format they understand, instead of an error prone format that requires so much effort to parse. Absent that, I keep my heart warmed by watching XML junkies who have never written a grammar in their lives spout off about how it's the holy grail that solves problems that were solved decades ago. ;-) --Chris -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg
