Just for reference, my background is in the area of VLSI design where I have been dealing with proprietary, buggy, piece of crap storage and interchange formats for *decades*. Every vendor outputs a different *broken* format; often it is the same as their internal format. Yes, they often introduce bugs into their own data. Each one has a different set of characters that they accept (tool A only takes uppercase ASCII; tool B gives case meaning; tool C can accept punctuation but only allows lowercase).

XML and Unicode are useful for the same reasons that I find open source useful--I can fix the problem if I absolutely need to.

Christopher Smith wrote:
Andrew Lentvorski wrote:
 >First, parsers are *hard*.  Every idiot CS major thinks he a can write
 >a parser for his "little language".  They are all wrong.

Perhaps, but writing a grammar for your "little language" and then
using a parser generator of some kind is not that difficult.

Real-world results are against you here ... ;)

I have yet to get see a grammar specification for any interchange format I work with other than XML.

Usually when I write a parser, like with most code, it's a painful
iterative process where I write code that doesn't pass tests and
then fix it. What you are describing is a defect in a development
process that won't be fixed by using XML. The problem will just
be moved elsewhere.

No, it doesn't necessarily get fixed. However, it allows those of us who *do* care to be able to cope with the idiocy.

 >XML *forces* these morons to have to interface with a
 >structured, debugged parser.

Ah, I see you've not yet encountered Perl hackers who use regexps
to extract data from XML and HTML documents. Trust me, there
is no "forcing" going on.

Oh, yes, I have. They start with "just a quick hack"--and it works, mostly. It just needs a few hand tweaks. And then that script gets bigger and needs a couple more tweaks. Lather, rinse, repeat--until you have a totally unintelligible mess.

Actually, this is one of my interview questions for VLSI CAD tool administrators. I give them a structured file (slightly modified Spice simulator input deck) and ask them to write code to cope with it. Those who use regexes fail--they invariably have silent failure modes (*very* bad when your script may be a check which has to prevent a $1 million mistake). Those who use a parsing library and build a tree or those who transform it to XML and use DOM/SAX pass.

First, not all XML parsers throw exceptions (indeed it's hard
to find a C parser that does ;-). Secondly, unless you are talking
about a failure during validation, SAX and DOM tend to fail for
the same reasons that read() and write() fail. If you are talking
about validation, you have a point, but unfortunately most folks
using SAX and DOM don't use a DTD or Schema, and therefore
no validation, and so SAX and DOM primarily serve as glorified
lexers, rather than parsers. In many ways, programmers are
actually *less* likely to be forced to define a grammar than if
they were living in an XML-less world.

Well, since my sample size for those who defined a grammar is *zero*, I am in the same boat either way.

The difference is that with XML, *I* can create the DTD or Schema to do validation even if the original author doesn't. This allows me to trap badly formed incoming data *before* it has a chance to enter my system.

It's beautiful on one hand and painful on the other. Certainly
this has a negative impact on parsing performance, which has
caused me no end of trouble --ironically to the point where I've
been forced to write my own parsers.

Wow. So why was your parser so much faster and why couldn't the normal system do that?

Furthermore, Unicode is it's own messy kettle of fish.

Sure. Internationalization is *hard*. Unicode needed some very smart people working for quite a lot of years to produce the standard that we have.

So you end up spending a lot of time
converting back and forth between your apps "native" string
library and that of your parser.

Really? Ouch. I don't tend to hit that problem. However, I tend to only use languages that have native Unicode String types.

If an XML parser made me convert away from that, I just wouldn't use it.

It's great and all to think about Unicode when you are writing
code, but it's a pain when you have to parse an 800GB file that's
all ASCII.

Heh. For me, Unicode isn't the problem. Most of my data is numeric. Writing out 10^8 polygon coordinates in UTF-8 is where all the bloat is. I'll still take XML, though, so that the human readable *attributes* on those polygons are internationalized.

 >And herein lies the source of the XML verbosity that everybody
 >complains about--balanced close tags.

You will note that JSON also uses balanced tags. I guess the
closing tags are named, which is what you are getting at.

Yes, sorry, that wasn't explicit. Named closing tags seem to be *the* feature that made XML take off. Even with smart editors, for some reason people never took to s-expressions but did take to XML. The Lisp folks whine about this continuously.

I really think this is the key. The named closing tags guarantee a certain amount of error localization.

I'm sorry, but the only reason I can imagine for verbose closing
tags to help catch a syntax error would be if a human was
generating them, and even then primarily it'd be because they
made a typeoh when typing in the name of the closing tag,
which means you're getting a lot of otherwise unnecessary
syntax errors. Sure making people type more means you catch
more syntactic errors, but that's not an improvement if those
errors are *caused* by them having to type more.

Or, a buggy interchange generator. This is the bigger problem. Quite often one of the lesser used code paths in the generator acquires a bug. Without named closing tags, this bug can slide past a lot of parsers as it occurs infrequently. Even then, it can be really hard to track down since it can occur a long way from where it was indicated.

I'd argue that the problems one encounters with a parser are
pretty much entirely in the "nasty" corners you are talking
about. Once you throw away all that stuff what you are left
with is little more than lexer with some notion of hierarchical
structures. That is of questionable benefit given the price
you pay for using XML.

Here we disagree. The fact that I *can* activate the nasty stuff later on is more than sufficient for me to incur the overhead.

The most hilarious part to me about XML is the "extensible"
part of it. If I had a dime for every program that I've seen that
works with XML but starts spewing errors as soon as you
"extend" a document with some new tags, I'd be rich.

Oh, yeah.  Been there--seen that.

However, XML tends to be more extensible than *anything* else. It requires quite a bit of work to create an older, readable XML document that cannot be read by newer parsers. Backward compatibility requires a bit of thought but normally not *too* much.

Unfortunately, even that small amount of brainpower seems to exceed what most programmers are capable of.

More importantly though, it's worth pointing out that this
XML crud is increasingly being used for stuff that is only
read and written by machines, partly due to its spiraling
complexity which makes use by humans too painful. I have
to wonder when someone is going to ask if perhaps it makes
sense to pass around numbers between computers in a format
they understand, instead of an error prone format that
requires so much effort to parse.

Sure, that's what ASN.1 is for. Really. If you have never used ASN.1, you should go look up the standard. It is especially useful for numeric data. I tend to use ASN.1 when I need to move numeric data.

I, however, haven't seen the spiraling XML complexity. I use XML for what it should be used as--an interchange format. I don't try to make it into a database; I don't try to give it semantic meaning; I don't try to index it using completely unrelated tools.

Absent that, I keep my heart warmed by watching XML
junkies who have never written a grammar in their lives
spout off about how it's the holy grail that solves problems
that were solved decades ago. ;-)

Absolutely. XML is not magic. In fact, all of the problems it "solves" have been solved before. The difference is that XML allows people to apply pressure to the idiots in a simple way--"Does it talk XML? No? Come back when it does."

-a

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Reply via email to