Re: [E-devel] RFC: eina simple (and stupid) XML parser

Joerg Sonnenberger Mon, 28 Feb 2011 10:45:25 -0800

On Mon, Feb 28, 2011 at 02:49:06PM -0300, Gustavo Sverzut Barbieri wrote:
> How many HTML pages do you see declaring new entities?  Of course
> parsing HTML with it it's better to do using the SAX so you can handle
> close-tags automatically as most people don't close things like <br>
> or <img>.


Actually, <br> and <img> are not closed in HTML. It has SGML semantic
for tags without content. This is different from the XML based dialects.

> >> going with JSON tends to be a much simpler option...
> 
> No, if you have a choice, go with EET it's much simpler and efficient.

For things you have control over and don't plan to exchange with anyone
else: sure.

> >> That doesn't work either. XML can't be parsed encoding neutral. Consider
> >> documents in shift_jis for this. If you implement a fallback path to
> >> handle all well formed XML documents using a full blown parser, you
> >> haven't saved anything in terms of code complexity and the request for a
> >> benchmark made in this thread is completely valid to justify the
> >> *additional* complexity.
> 
> Check out: /usr/share/hal/fdi/*/*.fdi  and tell me what difference it
> would make.

How is that relevant? This attitude is exactly the source of the
majority of all interoperability issues. "My files don't use this
feature." The next one is going to use this XML parser because it is
fast (hopefully) and simple for a different set of files. Oops.

> That's my problem with XML people, they can't tell the difference
> between theory and reality. In theory you can build all kinds of
> corner cases to prove me wrong, but reality shows that we can do just
> fine for what we need.

Congratulation. You have just shown the attitude everyone condamned
about the browser wars. Why do a proper implementation if you can get
away with 80% to get your own test set to work. If you want to use it
only for FDO files, get them to restrict the specification to well
defined subset of XML. You don't even have the excuse that there is no
ready made XML parser. If you use fontconfig, you already have one
pulled in. I am not an XML advocat. But I do care about people pulling
stupid short cuts. History consistently tells us that such assumptions
are almost always broken at some point.

> Reality is that you just need to find < and >, with the exception of
> <![CDATA[ ... ]]>. Most people don't even use this cdata case. Most
> files, although declared as UTF-8 are actually ASCII, with non-ASCII
> converted to entities/escaped. If you can find out some case that
> providing real UTF-8 strings would break it, then I'll care to fix it.

Sorry, but I do have real world XML files using UTF-8, non-ASCII
encodings, entities (both for plain characters and more).

> Again, any real use case? As for entities, checking for them is more
> harm than good:
>    - you waste time looking for them;
>    - you need to allocate memory to write the resulting bytes;
>    - you now have a new problem: which encoding should I write to? If
> the document is in encoding ISO-8859-1, you'd need to convert it to
> UTF-8 before doing entities? But what if user wants to keep in
> ISO-8859-1? Do you convert back? What to do with unsupported chars in
> this set?
>    - how about if your presentation handles entities for you? Like
> Evas/Edje/Elementary? You did all of the above for what good?

Variant 1: input is ISO-8859-1, program wants ISO-8859-1: no recoding
needed. Entities not representable in ISO-8859-1 are invalid and should
be flagged as error.

Variant 2: input is ISO-8859-1, program wants UTF-8: recoding one the
fly of the input. Entities not representable as valid Unicode Code Point
are invalid and should be flagged as error.

Variant 3: input is UTF-8, program wants UTF-8: no recoding, just
consistency checks. Entities not representable as valid Unicode Code
Point are invalid and should be flagged as error.

Variant 4: input is UTF-8, program wants ISO-8859-1: recoding on the fly
of the input. Entities or input characters not representable as
ISO-8859-1 should be flagged as error.

Not doing the validation in the XML parser typically just results in
continuing to process arbitrary garbage. Examples for why this is bad
would be the NUL validation in X.509 certificates, just to pick
something used to commit fraud in the real world. If you are lucky, you
run against an interface that actually validates and throws an error.
You just lost all the position information for a proper error message.
Or you have to duplicate the input validation logic in the application.
Great simplification again.

> Most of the times we'll be reading configuration files with it. Or
> results of RPC-XML calls. Usually you'll know for sure fields you
> could have them and what to replace. Example: if you're reading
> something that you'll turn into URL, then just for that field you can
> convert to %AB convention instead of converting to UTF-8 and then %AB
> format.

You know, you just mentioned the one big reason for why it is a bad
idea. You think it is a good idea to interface with arbitrary XML
implementations with something that doesn't understand XML. How do you
convert something to URL syntax without knowing the source encoding? You
assume that the remote location uses the same?

Joerg

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] RFC: eina simple (and stupid) XML parser

Reply via email to