On Sun, Jan 22, 2012 at 9:56 PM, Dan Dennedy <d...@dennedy.org> wrote: > ---------- Forwarded message ---------- > From: Brian Matherly <pez4br...@yahoo.com> > Date: Sun, Jan 22, 2012 at 9:04 PM > Subject: Re: [Mlt-devel] Xml output is currenty broken > To: Dan Dennedy <d...@dennedy.org> > > > Dan, > > >>> OK, av_dict (and its predecessor av_metadata from avformat.h) has >>> neither rules or API for character encoding. Demuxers quite often >>> simply pass up whatever appears in the file. And the offending >>> character in this example is a 0x0b vertical tab. So, I think we need >>> some string filter function. I am looking around for a good practice >>> regarding string filtering. Then, the next question is whether to >>> filter the output of av_dict or filter the input to xmlNewTextChild() >>> and xmlNewProp(). >>> >> >> We may need to add a UTF-8 filter in some places, and we can use iconv >> for that. However, XML has a more restricted set of characters: >> >> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] >> | [#x10000-#x10FFFF] /* any Unicode character, excluding the >> surrogate blocks, FFFE, and FFFF. */ >> >> So, we need something specific to XML here instead of a UTF-8 filter >> on the av_dict output. I put in a quick fix. I will come back to the >> wchar solution soon. > > I think a function to perform a quick and dirty sanitization for all > av_dict output would be appropriate. It is pretty important that MLT > is able to import it's own files. So I would suggest: when in doubt, > pass the string through the sanitizer function before serializing it.
It does now. > It might be considerate to print a message to stderr when a character > needs to be replaced. Or, I've seen implementations that replace > invalid characters with "?" to make it obvious that a character was > replaced. I do not like '?' because it also occurs elsewhere (unlike the boxy [?] I think some Microsoft products display), often the invalid characters are not printable (vertical tab), and because for filtering UTF-8 (as opposed to UTF-8 for XML) I plan to use iconv_open() with its "//IGNORE" option to strip them. >> I thought of a new policy to add to docs/policies.txt and somewhere in >> doxygen comments: >> >> The standard for strings in MLT is UTF-8. Applications must provide >> valid UTF-8. That means, melt would be responsible for converting from >> environment locale's encoding to UTF-8. > > And melt would be responsible for converting from UTF-8 to environment > locale's encoding for output. makes sense >> For dependent libraries, if >> their API or documentation discloses character the encoding we need to >> convert it to UTF-8 (and filtered by icon along the way), and if it >> unknown (e.g. av_dict), then we should assume UTF-8 and filter it. >> Comments accepted. > > Makes perfect sense to me. Should we also require a particular > encoding (UTF-8) for XML files? Or is it OK to accept any XML file > encoding? no, because XML files declare their encoding, have auto-detection [1] and fallback to UTF-8. Then, libxml converts everything to UTF-8 [2]. [1] http://www.opentag.com/xfaq_enc.htm#enc_default [2] http://xmlsoft.org/encoding.html#internal > I thought that I could add some sanity checking to producer_xml.c and > maybe even fix broken input XML encoding. But libxml doesn't make that > simple. > > I did come up with some additional error reporting so that if the xml > fails to load, at least it doesn't fail silently: > https://github.com/pez4brian/mlt/commit/5fc4d19e81e658e1236da5b82457cfa8b428a705 > Feel free to pull it if you like. > > For JB's example file, it prints an error like this: > XML parse error: PCDATA invalid Char value 11 > row: 21 col: 20 > XML parse error: PCDATA invalid Char value 11 > row: 30 col: 20 I definitely like the contribution, but it should use mlt_log_warning() instead of fprintf(stderr). The other fprintf(stderr) you see there are simply because that file was not yet updated to use mlt_log. My policy is that any new code or code changed around a legacy approach should adopt the new approach instead of deferring to consistency with the older code in a file. (Same goes for other things like no longer comparing pointers with NULL.) However, a single commit should not include both a logic change and a comprehensive update to the new approach. -- +-DRD-+ ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel