Re: [E-devel] RFC: eina simple (and stupid) XML parser

Gustavo Sverzut Barbieri Mon, 28 Feb 2011 11:14:42 -0800

On Mon, Feb 28, 2011 at 3:43 PM, Joerg Sonnenberger
<jo...@britannica.bec.de> wrote:
> On Mon, Feb 28, 2011 at 02:49:06PM -0300, Gustavo Sverzut Barbieri wrote:
>> How many HTML pages do you see declaring new entities?  Of course
>> parsing HTML with it it's better to do using the SAX so you can handle
>> close-tags automatically as most people don't close things like <br>
>> or <img>.
>
> Actually, <br> and <img> are not closed in HTML. It has SGML semantic
> for tags without content. This is different from the XML based dialects.
>
>> >> going with JSON tends to be a much simpler option...
>>
>> No, if you have a choice, go with EET it's much simpler and efficient.
>
> For things you have control over and don't plan to exchange with anyone
> else: sure.
>
>> >> That doesn't work either. XML can't be parsed encoding neutral. Consider
>> >> documents in shift_jis for this. If you implement a fallback path to
>> >> handle all well formed XML documents using a full blown parser, you
>> >> haven't saved anything in terms of code complexity and the request for a
>> >> benchmark made in this thread is completely valid to justify the
>> >> *additional* complexity.
>>
>> Check out: /usr/share/hal/fdi/*/*.fdi  and tell me what difference it
>> would make.
>
> How is that relevant? This attitude is exactly the source of the
> majority of all interoperability issues. "My files don't use this
> feature." The next one is going to use this XML parser because it is
> fast (hopefully) and simple for a different set of files. Oops.


I don't get the "oops".  It is the sole purpose of it.  You can try
it, if you don't like, or don't fulfill your needs, feel free to try
another.

Right now we're FORCED to use bloatware as we have no choice. Or do as
in efreet and write our own parsers. Efreet's parser is good enough,
nobody ever complained it failed at understanding the contents. I'm
just proposing a lighter version (as it does 0 allocations) of it that
can be used outside of Efreet, like in Edje TEXTBLOCK.

Given ideas at #edevelop, I'm considering an key-value parser as well,
it could take specs and parse RFC822-like messages, as well as ini
files compatible with FDO and Windows. (spec is key-value and list
delimiters, comment separator, if it takes multiline or not...)


>> That's my problem with XML people, they can't tell the difference
>> between theory and reality. In theory you can build all kinds of
>> corner cases to prove me wrong, but reality shows that we can do just
>> fine for what we need.
>
> Congratulation. You have just shown the attitude everyone condamned
> about the browser wars. Why do a proper implementation if you can get
> away with 80% to get your own test set to work. If you want to use it
> only for FDO files, get them to restrict the specification to well
> defined subset of XML. You don't even have the excuse that there is no
> ready made XML parser. If you use fontconfig, you already have one
> pulled in.

And it is a bugger. Fontconfig enters the category that David Seikel
was mentioning. Man, it is slow, it is bogus. It adds to our load time
for no good. It is a core component of every Linux desktop, yet people
do such crap.

It is widely known, and otherwise proved by Rasterman, that
FreeDesktop.org guys are not concerned with any performance issues.
I'm yet to see any reasonable specification from them... all they do
is write some crap, stick into Gnome and bingo, it's a spec. Forcing
PNG for thumbnails, XML for fontconfig, INI for desktops/services is
non-sense. (yeah, you could save a cache for xml and ini file
resulting compilation, and we have to do these...)


> I am not an XML advocat. But I do care about people pulling
> stupid short cuts. History consistently tells us that such assumptions
> are almost always broken at some point.

:-) That's impossible to fix, it's human nature. If you must blame
someone, blame people that stupidly added these 20% useless bits to
the spec.

And really, these "oh cool" but useless features is what makes our web
damn slow. I work with WebKit (EFL port) and you can't imagine how
much work is to figure out things, from xmls (that may be broken) to
insane CSS rules... people that like CSS never ever tried to
understand the work browsers need to compute the resulting style!


>> Reality is that you just need to find < and >, with the exception of
>> <![CDATA[ ... ]]>. Most people don't even use this cdata case. Most
>> files, although declared as UTF-8 are actually ASCII, with non-ASCII
>> converted to entities/escaped. If you can find out some case that
>> providing real UTF-8 strings would break it, then I'll care to fix it.
>
> Sorry, but I do have real world XML files using UTF-8, non-ASCII
> encodings, entities (both for plain characters and more).

care to say the path of them in your system? I want to check them in mine.


>> Again, any real use case? As for entities, checking for them is more
>> harm than good:
>>    - you waste time looking for them;
>>    - you need to allocate memory to write the resulting bytes;
>>    - you now have a new problem: which encoding should I write to? If
>> the document is in encoding ISO-8859-1, you'd need to convert it to
>> UTF-8 before doing entities? But what if user wants to keep in
>> ISO-8859-1? Do you convert back? What to do with unsupported chars in
>> this set?
>>    - how about if your presentation handles entities for you? Like
>> Evas/Edje/Elementary? You did all of the above for what good?
>
> Variant 1: input is ISO-8859-1, program wants ISO-8859-1: no recoding
> needed. Entities not representable in ISO-8859-1 are invalid and should
> be flagged as error.
>
> Variant 2: input is ISO-8859-1, program wants UTF-8: recoding one the
> fly of the input. Entities not representable as valid Unicode Code Point
> are invalid and should be flagged as error.
>
> Variant 3: input is UTF-8, program wants UTF-8: no recoding, just
> consistency checks. Entities not representable as valid Unicode Code
> Point are invalid and should be flagged as error.
>
> Variant 4: input is UTF-8, program wants ISO-8859-1: recoding on the fly
> of the input. Entities or input characters not representable as
> ISO-8859-1 should be flagged as error.
>
> Not doing the validation in the XML parser typically just results in
> continuing to process arbitrary garbage. Examples for why this is bad
> would be the NUL validation in X.509 certificates, just to pick
> something used to commit fraud in the real world. If you are lucky, you
> run against an interface that actually validates and throws an error.
> You just lost all the position information for a proper error message.
> Or you have to duplicate the input validation logic in the application.
> Great simplification again.
>
>> Most of the times we'll be reading configuration files with it. Or
>> results of RPC-XML calls. Usually you'll know for sure fields you
>> could have them and what to replace. Example: if you're reading
>> something that you'll turn into URL, then just for that field you can
>> convert to %AB convention instead of converting to UTF-8 and then %AB
>> format.
>
> You know, you just mentioned the one big reason for why it is a bad
> idea. You think it is a good idea to interface with arbitrary XML
> implementations with something that doesn't understand XML. How do you
> convert something to URL syntax without knowing the source encoding? You
> assume that the remote location uses the same?

Yes you can. If you're working against a server you can assume these
things because they focus on reality and not theory.  You don't see
twitter changing charsets randomly because "<?xml" processing
instruction can let you know which one you'll use. They stick with
one, and that one for a looooong time.


-- 
Gustavo Sverzut Barbieri
http://profusion.mobi embedded systems
--------------------------------------
MSN: barbi...@gmail.com
Skype: gsbarbieri
Mobile: +55 (19) 9225-2202

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] RFC: eina simple (and stupid) XML parser

Reply via email to