Andrew Lentvorski wrote:
Christopher Smith wrote:
I would argue you can cope with the idiocy regardless. However, it is
fair to say that XML does provide some buffers against certain types of
idiocy. That said, there are several other approaches which do a much
better job of buffering idiocy.
Maybe. But writing a program to eat "HSPICE Data Output File" format
is a lot easier when it is in undocumented XML that when it is in
undocumented binary. The simple addition of knowing where the
delimiters are helps tremendously.
..and at the same time entirely new problems are generated that tend to
be easier to deal with in the binary world. For starters, it's bloody
hard to get an idea of limits on the size of your numbers. If I had a
dime for every idiot who's used SAX or DOM to pull out a value for a
32-bit unsigned int only to find out that the value can be a 64-bit
signed int....
Interestingly, doing a bit of differential analysis on binary formats
tends to yield far more information than from human readable formats. In
some ways, working the way humans works, actually makes the job harder.
Actually, this is one of my interview questions for VLSI CAD tool
administrators. I give them a structured file (slightly modified Spice
simulator input deck) and ask them to write code to cope with it.
Those
who use regexes fail--they invariably have silent failure modes (*very*
bad when your script may be a check which has to prevent a $1 million
mistake).
It's weird, any regexp library I've seen has a "match" operation that
can and does fail when it doesn't get a match. That said, trying to use
a regexp to parse a file format is an incredible pain to get right.
That problem is rarely failing to get a match. It will be getting a
false match.
The issue is the same with false positives. This is a problem with a
poorly specified regexp. Sadly you can have the same effect with some
fairly well specified SAX or DOM-based parsing operations, or for that
matter with a poorly defined grammar. Idiocy tends to trump one's best
efforts in this area.
Regardless of whether it's written in XML, you can write a grammar for
what you perceive is this undocumented format and use it to validate
data. Unfortunately, much as with a DTD or Schema that's created in a
black box scenario, you might end up with some false negatives.
For binary formats, that's not really true. Writing a validator for
binary formats means bugs in the "schema" as well as the "validator".
Huh? You lost me here.
I've had to try to get Unicode data moving back and forth between Oracle
databases, message queuing software, and tools written in different
languages, and it's been my experience each transition for one tool to
another involved some lovely compatibility issues, often even when just
using the old character-set world would have made it pretty easy.
Oh, yeah. Fortunately, most things are now speaking UTF-8.
All the Unicode in the above example was encoded in UTF-8.
Unfortunately, UTF-8 "support" tends to be a fairly loosely defined
term. ;-)
I am surprised that there isn't a Boost substitute for C++ String that
is fully Unicode compliant. C++ STL String has lots of idiocies.
Actually std::string<uint_32> works about as fine as anything for UTF-32
Unicode strings. When C++ was first getting standardized, everyone
thought std::string<wchar> would provide perfect Unicode support (and it
did, for revisions of Unicode back then). Converting to and from the
native character set is left to locales, and so it becomes a platform
specific thing. Ironically in many ways C++ tends to be the best
language to work with Unicode stuff, because it's lack of a universal
string library tends to mean components are written to make it easy to
move between whatever_string_component_is_using and whatever Unicode
solution you've decided on.
<googles>
Yecch. It looks like C++ hasn't made any progress.
The ICU4C library is actually one of the most complete implementations
of Unicode support that I've seen anywhere. I'll ask again: what
programming language are you using which has got Unicode down so well?
I'd like to use it.
Again, I think you give XML too much credit. If you want to design a
format that is extensible, it's not hard to do it.
Actually, it is. Your parser has to parse generally. Most people who
design a format invariably create a parser with specific assumptions
because "it's easier". Later, they can't change that because "we have
all this existing data".
That is a problem with idiots, but the rest of the population learns
from these experiences and builds grammars with support for extensions
at a later date. There's all kinds of examples of this in just the SMTP
protocol that's making this discussion possible.
Using XML forces the use of general parsing early on.
I've seen no evidence of that. If any thing the evidence has been to the
contrary.
Especially since small parsing jobs tend to use DOM to start since
it can normally be rendered directly into an in-memory, tree data
structure.
As you yourself acknowledged, small parsing jobs tend to start off using
regexps, and from there they fork to SAX or DOM depending on the needs
of the app. Having an in-memory, tree data structure really doesn't help
much with making your format extensible though...
--Chris
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg