---------- Forwarded message ----------
From: Brian Matherly <pez4br...@yahoo.com>
Date: Sun, Jan 22, 2012 at 9:04 PM
Subject: Re: [Mlt-devel] Xml output is currenty broken
To: Dan Dennedy <d...@dennedy.org>


Dan,


>>  OK, av_dict (and its predecessor av_metadata from avformat.h) has
>>  neither rules or API for character encoding. Demuxers quite often
>>  simply pass up whatever appears in the file. And the offending
>>  character in this example is a 0x0b vertical tab. So, I think we need
>>  some string filter function. I am looking around for a good practice
>>  regarding string filtering. Then, the next question is whether to
>>  filter the output of av_dict or filter the input to xmlNewTextChild()
>>  and xmlNewProp().
>>
>
> We may need to add a UTF-8 filter in some places, and we can use iconv
> for that. However, XML has a more restricted set of characters:
>
> Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
> | [#x10000-#x10FFFF]  /* any Unicode character, excluding the
> surrogate blocks, FFFE, and FFFF. */
>
> So, we need something specific to XML here instead of a UTF-8 filter
> on the av_dict output. I put in a quick fix. I will come back to the
> wchar solution soon.

I think a function to perform a quick and dirty sanitization for all
av_dict output would be appropriate. It is pretty important that MLT
is able to import it's own files. So I would suggest: when in doubt,
pass the string through the sanitizer function before serializing it.
It might be considerate to print a message to stderr when a character
needs to be replaced. Or, I've seen implementations that replace
invalid characters with "?" to make it obvious that a character was
replaced.

> I thought of a new policy to add to docs/policies.txt and somewhere in
> doxygen comments:
>
> The standard for strings in MLT is UTF-8. Applications must provide
> valid UTF-8. That means, melt would be responsible for converting from
> environment locale's encoding to UTF-8.

And melt would be responsible for converting from UTF-8 to environment
locale's encoding for output.

> For dependent libraries, if
> their API or documentation discloses character the encoding we need to
> convert it to UTF-8 (and filtered by icon along the way), and if it
> unknown (e.g. av_dict), then we should assume UTF-8 and filter it.
> Comments accepted.

Makes perfect sense to me. Should we also require a particular
encoding (UTF-8) for XML files? Or is it OK to accept any XML file
encoding?

I thought that I could add some sanity checking to producer_xml.c and
maybe even fix broken input XML encoding. But libxml doesn't make that
simple.

I did come up with some additional error reporting so that if the xml
fails to load, at least it doesn't fail silently:
https://github.com/pez4brian/mlt/commit/5fc4d19e81e658e1236da5b82457cfa8b428a705
Feel free to pull it if you like.

For JB's example file, it prints an error like this:
XML parse error: PCDATA invalid Char value 11
    row: 21    col: 20
XML parse error: PCDATA invalid Char value 11
    row: 30    col: 20

~BM

P.S. You only responded to me. I'm not sure if you meant to reply to the list.

>


-- 
+-DRD-+

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel

Reply via email to