Re:

Christopher Smith Tue, 25 Oct 2005 11:18:32 -0700

Andrew Lentvorski wrote:
>First, parsers are *hard*.  Every idiot CS major thinks he a can write
>a parser for his "little language".  They are all wrong.


Perhaps, but writing a grammar for your "little language" and then
using a parser generator of some kind is not that difficult.

>Every parser they create is a broken piece of sh*t.  It doesn't get
>debugged; it doesn't get tested; it gets thrown out in 12 months
>for another one.

Usually when I write a parser, like with most code, it's a painful
iterative process where I write code that doesn't pass tests and
then fix it. What you are describing is a defect in a development
process that won't be fixed by using XML. The problem will just
be moved elsewhere.

>XML *forces* these morons to have to interface with a
>structured, debugged parser.

Ah, I see you've not yet encountered Perl hackers who use regexps
to extract data from XML and HTML documents. Trust me, there
is no "forcing" going on.

>SAX and DOM have their faults, but at least they get debugged.

I could swear most regexp libraries and parser generators get
debugged too. Strangely this doesn't prevent bugs from cropping
up in the code of people who use them.

>Watching programmers writhe in agony because the XML
>parser threw an exception on a boundary case that their
>puny little minds are too narrow to anticipate is a most
>rewarding experience.

First, not all XML parsers throw exceptions (indeed it's hard
to find a C parser that does ;-). Secondly, unless you are talking
about a failure during validation, SAX and DOM tend to fail for
the same reasons that read() and write() fail. If you are talking
about validation, you have a point, but unfortunately most folks
using SAX and DOM don't use a DTD or Schema, and therefore
no validation, and so SAX and DOM primarily serve as glorified
lexers, rather than parsers. In many ways, programmers are
actually *less* likely to be forced to define a grammar than if
they were living in an XML-less world.

>Second, internationalization is hard.

I missed something. What has i18n got to do with XML or JSON?

>How many ways are there to spell Tchaikovsky?

I believe there is only one Cyrillic spelling. There are several
different ways to translate it in to other alphabets.

>The same morons from above get *forced* into dealing
>with this kind of crud with XML when they bump into
>another program which refuses to accept that Author,
>Composer, etc is a unique key.

Not really. Once you have Unicode it's actually much easier
to make such things a unique key. The pain point here
normally comes from a failure to recognize that when translations
between alphabets are a one-to-one mapping. If you either
make the alphabet part of the key, or use unicode, then you are
good.

>And the whole fact that XML *specifies* Unicode is
>beautiful--no more slacking off and only accepting ASCII
>or, worse, only accepting letters and digits.

It's beautiful on one hand and painful on the other. Certainly
this has a negative impact on parsing performance, which has
caused me no end of trouble --ironically to the point where I've
been forced to write my own parsers.

Furthermore, Unicode is it's own messy kettle of fish. Most
programming languages have a native string library that isn't
100% compatible with Unicode. Those that are compatible tend
to be because the original string library has been hacked up to
handle the changes in the Unicode standard. The end result is
that most XML parsers tend to use their own string library and
often enforce a particular encoding for parsed strings (and the
joy of Unicode is that any given encoding is bound to be really
inefficient for someone ;-). So you end up spending a lot of time
converting back and forth between your apps "native" string
library and that of your parser.

It's great and all to think about Unicode when you are writing
code, but it's a pain when you have to parse an 800GB file that's
all ASCII.

>Third, XML parsers *complain* when you feed them garbage.

You must be using some strange regexp libraries and parser
generators. ;-)

>And herein lies the source of the XML verbosity that everybody
>complains about--balanced close tags.

You will note that JSON also uses balanced tags. I guess the
closing tags are named, which is what you are getting at. I
would agree that it can be helpful when dealing with human
generated XML docs, but for the most part people find it a
pain to generate XML, so it is generated by machines who
tend to balance things out automatically.

>Syntax errors almost always *immediately* cause parsing
>errors because they tend to bump into unbalanced tags; no
>silent degradation here--I approve.

I'm sorry, but the only reason I can imagine for verbose closing
tags to help catch a syntax error would be if a human was
generating them, and even then primarily it'd be because they
made a typeoh when typing in the name of the closing tag,
which means you're getting a lot of otherwise unnecessary
syntax errors. Sure making people type more means you catch
more syntactic errors, but that's not an improvement if those
errors are *caused* by them having to type more.

In reality, most syntax errors are more complex than not
balancing tags, and really tend to only be caught by things
like a grammar or a XML Schema (a DTD will get you half way
there). Often that's not even enough and the problem can
only be identified by semantic analysis. As a consequence
even when using XML, a programmer needs to spend about
the same amount of time programming defensively to weed
out syntactic errors.

>...can't deal with the fact that almost nothing in real life is
>a useful unique key...

Okay, first there seems to be an odd assumption here that you
actually need a unique key. Often you don't. When you do, it
is entirely possible to have unique keys based on "real life".
The trick is making sure you define the system in such a way
that the uniqueness rules make sense. Saying names are
unique might seem foolish, unless your database is made up
of the trade names of active members of SAG.

Either way, XML really doesn't have anything to say about
unique keys.

>I can avoid most of the gnarly, nasty corners of XML
>(namespaces and schemas/DTD's) while still retaining most
>of the advantages all while knowing that the gnarly, nasty
>stuff is available if I really need it.

I'd argue that the problems one encounters with a parser are
pretty much entirely in the "nasty" corners you are talking
about. Once you throw away all that stuff what you are left
with is little more than lexer with some notion of hierarchical
structures. That is of questionable benefit given the price
you pay for using XML.

The most hilarious part to me about XML is the "extensible"
part of it. If I had a dime for every program that I've seen that
works with XML but starts spewing errors as soon as you
"extend" a document with some new tags, I'd be rich.

More importantly though, it's worth pointing out that this
XML crud is increasingly being used for stuff that is only
read and written by machines, partly due to its spiraling
complexity which makes use by humans too painful. I have
to wonder when someone is going to ask if perhaps it makes
sense to pass around numbers between computers in a format
they understand, instead of an error prone format that
requires so much effort to parse.

Absent that, I keep my heart warmed by watching XML
junkies who have never written a grammar in their lives
spout off about how it's the holy grail that solves problems
that were solved decades ago. ;-)

--Chris

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re:

Reply via email to