Christopher Smith wrote:
I would argue you can cope with the idiocy regardless. However, it is
fair to say that XML does provide some buffers against certain types of
idiocy. That said, there are several other approaches which do a much
better job of buffering idiocy.
Maybe. But writing a program to eat "HSPICE Data Output File" format is
a lot easier when it is in undocumented XML that when it is in
undocumented binary. The simple addition of knowing where the
delimiters are helps tremendously.
Actually, this is one of my interview questions for VLSI CAD tool
administrators. I give them a structured file (slightly modified Spice
simulator input deck) and ask them to write code to cope with it. Those
who use regexes fail--they invariably have silent failure modes (*very*
bad when your script may be a check which has to prevent a $1 million
mistake).
It's weird, any regexp library I've seen has a "match" operation that
can and does fail when it doesn't get a match. That said, trying to use
a regexp to parse a file format is an incredible pain to get right.
That problem is rarely failing to get a match. It will be getting a
false match.
Regardless of whether it's written in XML, you can write a grammar for
what you perceive is this undocumented format and use it to validate
data. Unfortunately, much as with a DTD or Schema that's created in a
black box scenario, you might end up with some false negatives.
For binary formats, that's not really true. Writing a validator for
binary formats means bugs in the "schema" as well as the "validator".
I've had to try to get Unicode data moving back and forth between Oracle
databases, message queuing software, and tools written in different
languages, and it's been my experience each transition for one tool to
another involved some lovely compatibility issues, often even when just
using the old character-set world would have made it pretty easy.
Oh, yeah. Fortunately, most things are now speaking UTF-8.
I am surprised that there isn't a Boost substitute for C++ String that
is fully Unicode compliant. C++ STL String has lots of idiocies.
<googles>
Yecch. It looks like C++ hasn't made any progress.
If done properly, it's possible to have a sufficiently large data set
that can't be rendered into a DOM on a 32-bit machine, but can be
rendered in a parse tree for a much more well defined format.
Polygon information normally can't be rendered into a DOM format anyway.
Accessing polygons hierarchically rather than spatially makes very
little sense.
Again, I think you give XML too much credit. If you want to design a
format that is extensible, it's not hard to do it.
Actually, it is. Your parser has to parse generally. Most people who
design a format invariably create a parser with specific assumptions
because "it's easier". Later, they can't change that because "we have
all this existing data".
Using XML forces the use of general parsing early on. Especially since
small parsing jobs tend to use DOM to start since it can normally be
rendered directly into an in-memory, tree data structure.
Yes, the key advantage of XML is the idiot buy-in factor. If you did
s/XML/ASN.1/, they'd say you were being rediculous. Unfortunately,
whenever you try to make something idiot proof, they just build a better
idiot, which is exactly what happens when you pressure someone to
provide an XML interface to their tool. :-(
The main problem there is the fact that the vendors try to actively
sabotage interchange because dumping your data allows you to change
vendors. In the VLSI design industry, the vendors subotaged EDIF
generators just like they now sabotage their XML generators.
-a
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg