[flexcoders] E4X normalize() + CDATA = invalid XML, data loss

flexcoders . list Tue, 11 Nov 2008 05:03:00 -0800

I'm getting some very strange results from E4X and normalize() when
working with CDATA text nodes, especially when those text nodes may
contain strings that, unescaped, represent CDATA end tags.


Consider the following code:

------------------------------------------------------------
            var x:XML = <test />;
            x.appendChild('<![CDATA[ test1 ]]]]>');
            x.appendChild('<![CDATA[> test2 ]]>');
            
            trace ('--- before normalize (string value) ---');
            trace (x.toString());
            trace ('--- before normalize (full xml)     ---');
            trace (x.toXMLString());
            trace ("\n");

            x.normalize();
            
            trace ('--- after normalize (string value) ---');
            trace (x.toString());
            trace ('--- after normalize (full xml)     ---');
            trace (x.toXMLString());
            trace ("\n");
            
            var xAsString:String = x.toXMLString();
            x = XML(xAsString);
            
            trace ('--- after reparse (string value) ---');
            trace (x.toString());
            trace ('--- after reparse (full xml)     ---');
            trace (x.toXMLString());
            trace ("\n");
------------------------------------------------------------

Here's the output when using Flash Player 10.0.12.36 Debug on Linux:

  --- before normalize (string value) ---
   test1 ]]> test2 
  --- before normalize (full xml)     ---
  <test>
    <![CDATA[ test1 ]]]]>
    <![CDATA[> test2 ]]>
  </test>


  --- after normalize (string value) ---
   test1 ]]> test2 
  --- after normalize (full xml)     ---
  <test><![CDATA[ test1 ]]> test2 ]]></test>


  --- after reparse (string value) ---
   test1 test2 ]]>
  --- after reparse (full xml)     ---
  <test>
    <![CDATA[ test1 ]]>
    test2 ]]&gt;
  </test>


Note how the call to .normalize() causes the text of <test> to be
concatenated to one incorrectly formatted CDATA node, containing an
unescaped "]]>" end-of-CDATA marker. The resulting XML is invalid and
will not parse with other XML parsers, such as libxml2's xmllint:

  badxml.xml:1: parser error : Sequence ']]>' not allowed in content
    <test><![CDATA[ test1 ]]> test2 ]]></test>

Using Flash's E4X to re-parse this XML does not throw an error, but the
resulting XML does not represent the original XML in any way. It appears
that the XML parser switches out of "CDATA-mode" when reaching the first
end-of-CDATA-marker (between 'test1' and 'test2'), and then enters some
sort of "lenient parser mode" where it "helpfully" converts the bare '>'
after test2 into &gt;. Of course, the resulting string value for
<test>'s text node is very much different from its original contents.
(compare 'after normalize (string value)' to 'after reparse (string
value)')

On the other hand, not calling .normalize() causes the resulting XML to
contain a newline "\n" character between the two original CDATA text
nodes, which when parsed by other xml readers usually results in
"]]\n>", or worse "]]\n          >". 



Anyone have any experience with how to properly embed strings containing
"xml-ish" content with E4X?

[flexcoders] E4X normalize() + CDATA = invalid XML, data loss

Reply via email to