Re: [Generateds-users] improper CDATA handling.

Dave Kuhlman Mon, 09 Feb 2015 10:58:22 -0800

On Mon, Feb 09, 2015 at 03:37:48PM +0000, George David wrote:
> Hi Dave,
> 
> I was hoping to reply almost immediately to the email I sent, but it was
> blocked awaiting approval. Ah well...
> 
> After I sent the email I tracked down the problem of encoding the actual
> CDATA tags to the quote_xml function. I added the following regex which
> allows for single and multiline text:
> 
> regEx = re.compile("<!\[CDATA\[.*\]\]>", re.DOTALL)
> match = regEx.match(s1)
> if match:
>     #it's wrapped in data tags, no need to encode it
> return s1
> 
> I'll submit that later on today unless you have any objections to it.
> 
> I've done a lot of research regarding CTAGS and event started a
> conversation on the LXML mailing list which can be seen here:
> 
> https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html
> 
> At any rate, one interesting tidbit of information was the reply to query
> about how to determine if a node had CDATA tags:
> 
> I wouldn't bother. CDATA[] is more of a convenience work-around when you
> are manually editing XML. In generated XML, it's not very useful.
> 
> 
> I have looked around but I haven't come up with anything definitive on
> this. I did notice that if I didn't use any CDATA tags, the javascript code
> was arrived just fine on the other end once it was decoded. I found this
> FAQ that was originally started by w3 according to their home page. It says
> that you should "almost never" use CDATA.
> 
> http://xml.silmaril.ie/cdata.html
> 
> What are your thoughts on this? I'm leaning towards no worrying about the
> CDATA tags that lxml stripped out since in practice they really don't seem
> to matter.


George,

OK, this time I really will remember to attach those two files: (1)
the patch file that adds the regular expressions for CDATA tags and
(2) the bit of demo code for testing the effect of using
``strip_cdata=False`` with an lxml parser.

Thanks for all the helpful thoughts on this.

And, I'll take a look at your regular expression to see how it
differs from mine, and whether there are cases one or the other
might not handle.  So, thanks for that.

I also lean towards stripping out CDATA tags when they do occur and
escaping characters ("&", "<", and ">") that occur inside them.

One way of thinking about this is to ask what you want the generated
API to give you?  For example, when you have a child element, as in
the sample schema you provided, defined as:

    <xs:element name="script"  type="xs:string"/>

In the enclosing class, generateDS.py generates:

    def get_script(self): return self.script                                    
    
    def set_script(self, script): self.script = script                          
    

When you execute:

    print my_object.get_script()

Do you expect or want to see any CDATA tags.  I suspect most users
would not.  If the CDATA tags could appear, users would have the burden
of stripping them out in many situations.  The fact that lxml strips
them out when you use:

    script.text    # where script is an lxml Element

suggests that the lxml development team thought so, too.

Another thought -- As seems to be suggested by the FAQ you linked to
(thanks for that), CDATA is a convenience to enable you to avoid
having escape "<" and "&" characters, for example in scripts and
code as in your example.  But, the generated code contains a
function (quote_xml()) to help you with that.  Perhaps we just need
to fix that function so that it does the right thing when there are
CDATA tags, which in more general cases might not surround the
entire content.

As I said, I'll look at your change and will re-consider mine, and
then will ask you to take a look at it before making it (semi-)
permanent.

Dave

> 
> As far as the multiple namespaces are concerned, I think it was a
> combination of adding the xml catalog support and the --one-file-per-xsd
> support, though I guess the xml catalog is not strictly necessary since the
> schema locations may all be defined in the import/include statements. I
> only tested it using the xml catalog since that's what we use.
> 
> Honestly I don't recall completely, but I believe that prior to my commits
> there was some support already in generateDS for multiple namespaces, but I
> found places where it wasn't correct. In the project I work on, we have
> over 400 unique xsd files each with their own namespace. Almos all of them
> import another XSD. In one case I see that we have a single XSD importing 9
> other ones. Most of our XSDs are handled properly by generateDS when using
> the --one-file-per-xsd option but we do encounter some problems like the
> one reported here:
> 
> http://sourceforge.net/p/generateds/mailman/message/33012215/
> 
> We probably have about 10 or so problems with the generated python code and
> I believe this is the only one that is related to namespaces.
> 
> On Sun Feb 08 2015 at 8:38:32 PM Dave Kuhlman <[email protected]>
> wrote:
> 
> > On Fri, Feb 06, 2015 at 01:05:42AM +0000, George David wrote:
> > > Hi Dave,
> > >
> > > I created a xsd that has an element called script. The intent is to allow
> > > users to send us javascript that is encoded with CDATA tags.
> > >
> > > In the attached files you can see that I set the script variable as
> > follows:
> > > cdataObj = Cdata()
> > >
> > > script='''<![CDATA[
> > >     var x, text;
> > >
> > >     // Get the value of the input field with id="numb"
> > >     x = document.getElementById("numb  one").value;
> > >
> > >     // If x is Not a Number or less than one or greater than 10
> > >     if (isNaN(x) || x < 1 || x > 10) {
> > >         text = "Input not valid";
> > >     } else {
> > >         text = "Input OK";
> > >     }
> > >     document.getElementById("demo").innerHTML = text;
> > > ]]>'''
> > > cdataObj.set_script(script)
> > >
> > > I exported it:
> > >
> > > cdataObj.export(sys.stdout, 0, name_='cdata')
> > >
> > > And got the following:
> > >
> > > <cdata:cdata xmlns:cdata="urn:cdata">
> > >     <cdata:script>&lt;![CDATA[
> > >     var x, text;
> > >
> > >     // Get the value of the input field with id="numb"
> > >     x = document.getElementById("numb  one").value;
> > >
> > >     // If x is Not a Number or less than one or greater than 10
> > >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
> > >         text = "Input not valid";
> > >     } else {
> > >         text = "Input OK";
> > >     }
> > >     document.getElementById("demo").innerHTML = text;
> > > ]]&gt;</cdata:script>
> > > </cdata:cdata>
> > >
> > > Note that the CDATA wrappers have been encoded <![CDATA[ has been changed
> > > to &lt;![CDATA[ and ]]> has been changed to ]]&gt;
> >
> > George,
> >
> > Good to hear from you again.
> >
> > One solution to the above is to use a more intelligent replacement.
> > The attached patch uses the re module and two regular expressions to
> > replace (escape) "<" and ">" without replacing "<![CDATA[" and "]]>".
> >
> > >
> > > Also notice that the < and > signs in the java script have also been
> > > encoded. I believe there should be code to check for the CDATA tags and
> > not
> > > xml encode it if they exist. I'll try to track this down in the code but
> > I
> > > wanted to make sure this wasn't done on purpose.
> > >
> > > There is another problem with CDATA. If I create an xml string with CDATA
> > > and parse it like this:
> >
> > Re: the missing CDATA wrappers:
> >
> > The problem is that when the generated code uses lxml to parse an
> > XML instance doc, lxml strips away the "<![CDATA[" and "]]>".  I
> > don't believe that we can even tell that they were there in the
> > first place.  The attached script (cdata_demo.py) attempts to
> > demonstrate this.
> >
> > So, after that XML instance doc has been parsed, there is no way to
> > tell that the CDATA tags were there in the first place.
> >
> > Wait ... I did one more Web search ...
> >
> > It's even the case that lxml has a special provision for this issue.
> > I found this: http://lxml.de/api.html#cdata
> >
> > (It's incredible what kind of hidden information you can find with a
> > Web search engine.  You should try one sometime.  But, seriously, ...)
> >
> > However, when you use ``element.text`` to capture the text data, the
> > CDATA tags are still missing, even though when you use
> > ``etree.tostring(some_element)`` they are there.
> >
> > I haven't figured out how to deal with this, yet.  I'll think a bit
> > more on it.
> >
> > If you can think of a work-around for this, please let me know.
> >
> > On an unrelated subject -- generateDS.py does not handle multiple
> > namespaces in the same XML schema, in particular when ``<xs:import
> > ...>`` is used.  I've had several reports about this.  If I recall
> > correctly, you contributed the code that implements
> > --one-file-per-xsd.  I'm wondering if that might be helpful in some
> > of these situations.  If you have any comments or suggestions about
> > this, I'd be interested in hearing them.
> >
> > And, have you had any experience with lxml.objectify?
> > (http://lxml.de/objectify.html)  I'm wondering whether it might
> > solve some of these problems (in particular the namespaces and CDATA
> > issues) better that generateDS.py does.  Maybe we can learn
> > something from it.
> >
> > More later.
> >
> > Dave
> >
> > >
> > > xml='''
> > > <cdata:cdata xmlns:cdata="urn:cdata">
> > >     <cdata:script><![CDATA[
> > >     var x, text;
> > >
> > >     // Get the value of the input field with id="numb"
> > >     x = document.getElementById("numb  one").value;
> > >
> > >     // If x is Not a Number or less than one or greater than 10
> > >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
> > >         text = "Input not valid";
> > >     } else {
> > >         text = "Input OK";
> > >     }
> > >     document.getElementById("demo").innerHTML = text;
> > > ]]></cdata:script>
> > > </cdata:cdata>
> > > '''
> > > cdata.parseString(xml)
> > >
> > > It incorrectly strips out the CDATA tags:
> > >
> > > parseString spits out xml with the CDATA tags removed.:
> > >
> > > <?xml version="1.0" ?>
> > > <cdata:cdata xmlns:cdata="urn:cdata">
> > >     <cdata:script>
> > >     var x, text;
> > >
> > >     // Get the value of the input field with id="numb"
> > >     x = document.getElementById("numb  one").value;
> > >
> > >     // If x is Not a Number or less than one or greater than 10
> > >     if (isNaN(x) || x &amp;lt; 1 || x &amp;gt; 10) {
> > >         text = "Input not valid";
> > >     } else {
> > >         text = "Input OK";
> > >     }
> > >     document.getElementById("demo").innerHTML = text;
> > > </cdata:script>
> > > </cdata:cdata>
> > >
> > >
> > > And on printing the script specifically, I also don't have CDATA tags
> > > anymore and the < and > are xml encoded.
> > >
> > > print cdataObj.get_script()
> > >
> > >     var x, text;
> > >
> > >     // Get the value of the input field with id="numb"
> > >     x = document.getElementById("numb  one").value;
> > >
> > >     // If x is Not a Number or less than one or greater than 10
> > >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
> > >         text = "Input not valid";
> > >     } else {
> > >         text = "Input OK";
> > >     }
> > >     document.getElementById("demo").innerHTML = text;
> > >
> > >
> > > I'll see if I can track this down also. If you could give me a hint of
> > > where to look that would be helpful.
> > >
> > > Thanks,
> > > George
> >
> >
> > --
> >
> > Dave Kuhlman
> > http://www.davekuhlman.org
> >

-- 

Dave Kuhlman
http://www.davekuhlman.org

from lxml import etree


xml = '''
<cdata:cdata xmlns:cdata="urn:cdata">
    <cdata:script><![CDATA[
    var x, text;

    // Get the value of the input field with id="numb"
    x = document.getElementById("numb  one").value;

    // If x is Not a Number or less than one or greater than 10
    if (isNaN(x) || x &lt; 1 || x &gt; 10) {
        text = "Input not valid";
    } else {
        text = "Input OK";
    }
    document.getElementById("demo").innerHTML = text;
]]></cdata:script>
</cdata:cdata>
'''


def test():
    parser = etree.XMLParser(strip_cdata=False)
    element1 = etree.XML(xml, parser)
    element2 = element1[0]
    print xml
    print '=' * 60
    print 'Using etree.tostring(element):'
    print '------------------------------'
    print etree.tostring(element1)
    print '=' * 60
    print 'Using element.text:'
    print '-------------------'
    print element2.text


if __name__ == '__main__':
    test()

diff -r 9d56e3e892f0 generateDS.py
--- a/generateDS.py     Thu Jan 29 11:07:15 2015 -0800
+++ b/generateDS.py     Fri Feb 06 16:14:25 2015 -0800
@@ -4993,6 +4993,8 @@
 Tag_pattern_ = re_.compile(r'({.*})?(.*)')
 String_cleanup_pat_ = re_.compile(r"[\\n\\r\\s]+")
 Namespace_extract_pat_ = re_.compile(r'{(.*)}(.*)')
+Lessthan_escape_pat_ = re_.compile(r'<[^!]')
+Greaterthan_escape_pat_ = re_.compile(r'[^\]][^\]]>')
 
 #
 # Support/utility functions.
@@ -5011,8 +5013,8 @@
     s1 = (isinstance(inStr, basestring) and inStr or
           '%%s' %% inStr)
     s1 = s1.replace('&', '&amp;')
-    s1 = s1.replace('<', '&lt;')
-    s1 = s1.replace('>', '&gt;')
+    s1, count = Lessthan_escape_pat_.subn('&lt;', s1)
+    s1, count = Greaterthan_escape_pat_.subn('&gt;', s1)
     return s1

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/

_______________________________________________
generateds-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/generateds-users

Re: [Generateds-users] improper CDATA handling.

Reply via email to