---------- Forwarded message ---------- From: Brian Matherly <pez4br...@yahoo.com> Date: Sun, Jan 22, 2012 at 9:04 PM Subject: Re: [Mlt-devel] Xml output is currenty broken To: Dan Dennedy <d...@dennedy.org>
Dan, >> OK, av_dict (and its predecessor av_metadata from avformat.h) has >> neither rules or API for character encoding. Demuxers quite often >> simply pass up whatever appears in the file. And the offending >> character in this example is a 0x0b vertical tab. So, I think we need >> some string filter function. I am looking around for a good practice >> regarding string filtering. Then, the next question is whether to >> filter the output of av_dict or filter the input to xmlNewTextChild() >> and xmlNewProp(). >> > > We may need to add a UTF-8 filter in some places, and we can use iconv > for that. However, XML has a more restricted set of characters: > > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] > | [#x10000-#x10FFFF] /* any Unicode character, excluding the > surrogate blocks, FFFE, and FFFF. */ > > So, we need something specific to XML here instead of a UTF-8 filter > on the av_dict output. I put in a quick fix. I will come back to the > wchar solution soon. I think a function to perform a quick and dirty sanitization for all av_dict output would be appropriate. It is pretty important that MLT is able to import it's own files. So I would suggest: when in doubt, pass the string through the sanitizer function before serializing it. It might be considerate to print a message to stderr when a character needs to be replaced. Or, I've seen implementations that replace invalid characters with "?" to make it obvious that a character was replaced. > I thought of a new policy to add to docs/policies.txt and somewhere in > doxygen comments: > > The standard for strings in MLT is UTF-8. Applications must provide > valid UTF-8. That means, melt would be responsible for converting from > environment locale's encoding to UTF-8. And melt would be responsible for converting from UTF-8 to environment locale's encoding for output. > For dependent libraries, if > their API or documentation discloses character the encoding we need to > convert it to UTF-8 (and filtered by icon along the way), and if it > unknown (e.g. av_dict), then we should assume UTF-8 and filter it. > Comments accepted. Makes perfect sense to me. Should we also require a particular encoding (UTF-8) for XML files? Or is it OK to accept any XML file encoding? I thought that I could add some sanity checking to producer_xml.c and maybe even fix broken input XML encoding. But libxml doesn't make that simple. I did come up with some additional error reporting so that if the xml fails to load, at least it doesn't fail silently: https://github.com/pez4brian/mlt/commit/5fc4d19e81e658e1236da5b82457cfa8b428a705 Feel free to pull it if you like. For JB's example file, it prints an error like this: XML parse error: PCDATA invalid Char value 11 row: 21 col: 20 XML parse error: PCDATA invalid Char value 11 row: 30 col: 20 ~BM P.S. You only responded to me. I'm not sure if you meant to reply to the list. > -- +-DRD-+ ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel