On Mon, Feb 28, 2011 at 02:49:06PM -0300, Gustavo Sverzut Barbieri wrote: > How many HTML pages do you see declaring new entities? Of course > parsing HTML with it it's better to do using the SAX so you can handle > close-tags automatically as most people don't close things like <br> > or <img>.
Actually, <br> and <img> are not closed in HTML. It has SGML semantic for tags without content. This is different from the XML based dialects. > >> going with JSON tends to be a much simpler option... > > No, if you have a choice, go with EET it's much simpler and efficient. For things you have control over and don't plan to exchange with anyone else: sure. > >> That doesn't work either. XML can't be parsed encoding neutral. Consider > >> documents in shift_jis for this. If you implement a fallback path to > >> handle all well formed XML documents using a full blown parser, you > >> haven't saved anything in terms of code complexity and the request for a > >> benchmark made in this thread is completely valid to justify the > >> *additional* complexity. > > Check out: /usr/share/hal/fdi/*/*.fdi and tell me what difference it > would make. How is that relevant? This attitude is exactly the source of the majority of all interoperability issues. "My files don't use this feature." The next one is going to use this XML parser because it is fast (hopefully) and simple for a different set of files. Oops. > That's my problem with XML people, they can't tell the difference > between theory and reality. In theory you can build all kinds of > corner cases to prove me wrong, but reality shows that we can do just > fine for what we need. Congratulation. You have just shown the attitude everyone condamned about the browser wars. Why do a proper implementation if you can get away with 80% to get your own test set to work. If you want to use it only for FDO files, get them to restrict the specification to well defined subset of XML. You don't even have the excuse that there is no ready made XML parser. If you use fontconfig, you already have one pulled in. I am not an XML advocat. But I do care about people pulling stupid short cuts. History consistently tells us that such assumptions are almost always broken at some point. > Reality is that you just need to find < and >, with the exception of > <![CDATA[ ... ]]>. Most people don't even use this cdata case. Most > files, although declared as UTF-8 are actually ASCII, with non-ASCII > converted to entities/escaped. If you can find out some case that > providing real UTF-8 strings would break it, then I'll care to fix it. Sorry, but I do have real world XML files using UTF-8, non-ASCII encodings, entities (both for plain characters and more). > Again, any real use case? As for entities, checking for them is more > harm than good: > - you waste time looking for them; > - you need to allocate memory to write the resulting bytes; > - you now have a new problem: which encoding should I write to? If > the document is in encoding ISO-8859-1, you'd need to convert it to > UTF-8 before doing entities? But what if user wants to keep in > ISO-8859-1? Do you convert back? What to do with unsupported chars in > this set? > - how about if your presentation handles entities for you? Like > Evas/Edje/Elementary? You did all of the above for what good? Variant 1: input is ISO-8859-1, program wants ISO-8859-1: no recoding needed. Entities not representable in ISO-8859-1 are invalid and should be flagged as error. Variant 2: input is ISO-8859-1, program wants UTF-8: recoding one the fly of the input. Entities not representable as valid Unicode Code Point are invalid and should be flagged as error. Variant 3: input is UTF-8, program wants UTF-8: no recoding, just consistency checks. Entities not representable as valid Unicode Code Point are invalid and should be flagged as error. Variant 4: input is UTF-8, program wants ISO-8859-1: recoding on the fly of the input. Entities or input characters not representable as ISO-8859-1 should be flagged as error. Not doing the validation in the XML parser typically just results in continuing to process arbitrary garbage. Examples for why this is bad would be the NUL validation in X.509 certificates, just to pick something used to commit fraud in the real world. If you are lucky, you run against an interface that actually validates and throws an error. You just lost all the position information for a proper error message. Or you have to duplicate the input validation logic in the application. Great simplification again. > Most of the times we'll be reading configuration files with it. Or > results of RPC-XML calls. Usually you'll know for sure fields you > could have them and what to replace. Example: if you're reading > something that you'll turn into URL, then just for that field you can > convert to %AB convention instead of converting to UTF-8 and then %AB > format. You know, you just mentioned the one big reason for why it is a bad idea. You think it is a good idea to interface with arbitrary XML implementations with something that doesn't understand XML. How do you convert something to URL syntax without knowing the source encoding? You assume that the remote location uses the same? Joerg ------------------------------------------------------------------------------ Free Software Download: Index, Search & Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev _______________________________________________ enlightenment-devel mailing list enlightenment-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/enlightenment-devel