Re: [should be Re: JSON]

Christopher Smith Tue, 25 Oct 2005 21:13:24 -0700

Andrew Lentvorski wrote:

Christopher Smith wrote:
I would argue you can cope with the idiocy regardless. However, it is
fair to say that XML does provide some buffers against certain types of
idiocy. That said, there are several other approaches which do a much
better job of buffering idiocy.
Maybe. But writing a program to eat "HSPICE Data Output File" formatis a lot easier when it is in undocumented XML that when it is inundocumented binary. The simple addition of knowing where thedelimiters are helps tremendously.

..and at the same time entirely new problems are generated that tend tobe easier to deal with in the binary world. For starters, it's bloodyhard to get an idea of limits on the size of your numbers. If I had adime for every idiot who's used SAX or DOM to pull out a value for a32-bit unsigned int only to find out that the value can be a 64-bitsigned int....

Interestingly, doing a bit of differential analysis on binary formatstends to yield far more information than from human readable formats. Insome ways, working the way humans works, actually makes the job harder.

Actually, this is one of my interview questions for VLSI CAD tool
administrators.  I give them a structured file (slightly modified Spice

simulator input deck) and ask them to write code to cope with it.Those

who use regexes fail--they invariably have silent failure modes (*very*
bad when your script may be a check which has to prevent a $1 million
mistake).


It's weird, any regexp library I've seen has a "match" operation that
can and does fail when it doesn't get a match. That said, trying to use
a regexp to parse a file format is an incredible pain to get right.

That problem is rarely failing to get a match. It will be getting afalse match.

The issue is the same with false positives. This is a problem with apoorly specified regexp. Sadly you can have the same effect with somefairly well specified SAX or DOM-based parsing operations, or for thatmatter with a poorly defined grammar. Idiocy tends to trump one's bestefforts in this area.

Regardless of whether it's written in XML, you can write a grammar for
what you perceive is this undocumented format and use it to validate
data. Unfortunately, much as with a DTD or Schema that's created in a
black box scenario, you might end up with some false negatives.
For binary formats, that's not really true. Writing a validator forbinary formats means bugs in the "schema" as well as the "validator".


Huh? You lost me here.

I've had to try to get Unicode data moving back and forth between Oracle
databases, message queuing software, and tools written in different
languages, and it's been my experience each transition for one tool to
another involved some lovely compatibility issues, often even when just
using the old character-set world would have made it pretty easy.



Oh, yeah.  Fortunately, most things are now speaking UTF-8.

All the Unicode in the above example was encoded in UTF-8.Unfortunately, UTF-8 "support" tends to be a fairly loosely definedterm. ;-)

I am surprised that there isn't a Boost substitute for C++ String thatis fully Unicode compliant. C++ STL String has lots of idiocies.

Actually std::string<uint_32> works about as fine as anything for UTF-32Unicode strings. When C++ was first getting standardized, everyonethought std::string<wchar> would provide perfect Unicode support (and itdid, for revisions of Unicode back then). Converting to and from thenative character set is left to locales, and so it becomes a platformspecific thing. Ironically in many ways C++ tends to be the bestlanguage to work with Unicode stuff, because it's lack of a universalstring library tends to mean components are written to make it easy tomove between whatever_string_component_is_using and whatever Unicodesolution you've decided on.

<googles>

Yecch.  It looks like C++ hasn't made any progress.

The ICU4C library is actually one of the most complete implementationsof Unicode support that I've seen anywhere. I'll ask again: whatprogramming language are you using which has got Unicode down so well?I'd like to use it.

Again, I think you give XML too much credit. If you want to design a
format that is extensible, it's not hard to do it.
Actually, it is. Your parser has to parse generally. Most people whodesign a format invariably create a parser with specific assumptionsbecause "it's easier". Later, they can't change that because "we haveall this existing data".

That is a problem with idiots, but the rest of the population learnsfrom these experiences and builds grammars with support for extensionsat a later date. There's all kinds of examples of this in just the SMTPprotocol that's making this discussion possible.

Using XML forces the use of general parsing early on.

I've seen no evidence of that. If any thing the evidence has been to thecontrary.

Especially since small parsing jobs tend to use DOM to start sinceit can normally be rendered directly into an in-memory, tree datastructure.

As you yourself acknowledged, small parsing jobs tend to start off usingregexps, and from there they fork to SAX or DOM depending on the needsof the app. Having an in-memory, tree data structure really doesn't helpmuch with making your format extensible though...


--Chris

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: [should be Re: JSON]

Reply via email to