Re: [Generateds-users] improper CDATA handling.

George David Mon, 09 Feb 2015 07:38:39 -0800

Hi Dave,

I was hoping to reply almost immediately to the email I sent, but it was
blocked awaiting approval. Ah well...


After I sent the email I tracked down the problem of encoding the actual
CDATA tags to the quote_xml function. I added the following regex which
allows for single and multiline text:

regEx = re.compile("<!\[CDATA\[.*\]\]>", re.DOTALL)
match = regEx.match(s1)
if match:
    #it's wrapped in data tags, no need to encode it
return s1

I'll submit that later on today unless you have any objections to it.

I've done a lot of research regarding CTAGS and event started a
conversation on the LXML mailing list which can be seen here:

https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html

At any rate, one interesting tidbit of information was the reply to query
about how to determine if a node had CDATA tags:

I wouldn't bother. CDATA[] is more of a convenience work-around when you
are manually editing XML. In generated XML, it's not very useful.


I have looked around but I haven't come up with anything definitive on
this. I did notice that if I didn't use any CDATA tags, the javascript code
was arrived just fine on the other end once it was decoded. I found this
FAQ that was originally started by w3 according to their home page. It says
that you should "almost never" use CDATA.

http://xml.silmaril.ie/cdata.html

What are your thoughts on this? I'm leaning towards no worrying about the
CDATA tags that lxml stripped out since in practice they really don't seem
to matter.

As far as the multiple namespaces are concerned, I think it was a
combination of adding the xml catalog support and the --one-file-per-xsd
support, though I guess the xml catalog is not strictly necessary since the
schema locations may all be defined in the import/include statements. I
only tested it using the xml catalog since that's what we use.

Honestly I don't recall completely, but I believe that prior to my commits
there was some support already in generateDS for multiple namespaces, but I
found places where it wasn't correct. In the project I work on, we have
over 400 unique xsd files each with their own namespace. Almos all of them
import another XSD. In one case I see that we have a single XSD importing 9
other ones. Most of our XSDs are handled properly by generateDS when using
the --one-file-per-xsd option but we do encounter some problems like the
one reported here:

http://sourceforge.net/p/generateds/mailman/message/33012215/

We probably have about 10 or so problems with the generated python code and
I believe this is the only one that is related to namespaces.

On Sun Feb 08 2015 at 8:38:32 PM Dave Kuhlman <dkuhl...@davekuhlman.org>
wrote:

> On Fri, Feb 06, 2015 at 01:05:42AM +0000, George David wrote:
> > Hi Dave,
> >
> > I created a xsd that has an element called script. The intent is to allow
> > users to send us javascript that is encoded with CDATA tags.
> >
> > In the attached files you can see that I set the script variable as
> follows:
> > cdataObj = Cdata()
> >
> > script='''<![CDATA[
> >     var x, text;
> >
> >     // Get the value of the input field with id="numb"
> >     x = document.getElementById("numb  one").value;
> >
> >     // If x is Not a Number or less than one or greater than 10
> >     if (isNaN(x) || x < 1 || x > 10) {
> >         text = "Input not valid";
> >     } else {
> >         text = "Input OK";
> >     }
> >     document.getElementById("demo").innerHTML = text;
> > ]]>'''
> > cdataObj.set_script(script)
> >
> > I exported it:
> >
> > cdataObj.export(sys.stdout, 0, name_='cdata')
> >
> > And got the following:
> >
> > <cdata:cdata xmlns:cdata="urn:cdata">
> >     <cdata:script>&lt;![CDATA[
> >     var x, text;
> >
> >     // Get the value of the input field with id="numb"
> >     x = document.getElementById("numb  one").value;
> >
> >     // If x is Not a Number or less than one or greater than 10
> >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
> >         text = "Input not valid";
> >     } else {
> >         text = "Input OK";
> >     }
> >     document.getElementById("demo").innerHTML = text;
> > ]]&gt;</cdata:script>
> > </cdata:cdata>
> >
> > Note that the CDATA wrappers have been encoded <![CDATA[ has been changed
> > to &lt;![CDATA[ and ]]> has been changed to ]]&gt;
>
> George,
>
> Good to hear from you again.
>
> One solution to the above is to use a more intelligent replacement.
> The attached patch uses the re module and two regular expressions to
> replace (escape) "<" and ">" without replacing "<![CDATA[" and "]]>".
>
> >
> > Also notice that the < and > signs in the java script have also been
> > encoded. I believe there should be code to check for the CDATA tags and
> not
> > xml encode it if they exist. I'll try to track this down in the code but
> I
> > wanted to make sure this wasn't done on purpose.
> >
> > There is another problem with CDATA. If I create an xml string with CDATA
> > and parse it like this:
>
> Re: the missing CDATA wrappers:
>
> The problem is that when the generated code uses lxml to parse an
> XML instance doc, lxml strips away the "<![CDATA[" and "]]>".  I
> don't believe that we can even tell that they were there in the
> first place.  The attached script (cdata_demo.py) attempts to
> demonstrate this.
>
> So, after that XML instance doc has been parsed, there is no way to
> tell that the CDATA tags were there in the first place.
>
> Wait ... I did one more Web search ...
>
> It's even the case that lxml has a special provision for this issue.
> I found this: http://lxml.de/api.html#cdata
>
> (It's incredible what kind of hidden information you can find with a
> Web search engine.  You should try one sometime.  But, seriously, ...)
>
> However, when you use ``element.text`` to capture the text data, the
> CDATA tags are still missing, even though when you use
> ``etree.tostring(some_element)`` they are there.
>
> I haven't figured out how to deal with this, yet.  I'll think a bit
> more on it.
>
> If you can think of a work-around for this, please let me know.
>
> On an unrelated subject -- generateDS.py does not handle multiple
> namespaces in the same XML schema, in particular when ``<xs:import
> ...>`` is used.  I've had several reports about this.  If I recall
> correctly, you contributed the code that implements
> --one-file-per-xsd.  I'm wondering if that might be helpful in some
> of these situations.  If you have any comments or suggestions about
> this, I'd be interested in hearing them.
>
> And, have you had any experience with lxml.objectify?
> (http://lxml.de/objectify.html)  I'm wondering whether it might
> solve some of these problems (in particular the namespaces and CDATA
> issues) better that generateDS.py does.  Maybe we can learn
> something from it.
>
> More later.
>
> Dave
>
> >
> > xml='''
> > <cdata:cdata xmlns:cdata="urn:cdata">
> >     <cdata:script><![CDATA[
> >     var x, text;
> >
> >     // Get the value of the input field with id="numb"
> >     x = document.getElementById("numb  one").value;
> >
> >     // If x is Not a Number or less than one or greater than 10
> >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
> >         text = "Input not valid";
> >     } else {
> >         text = "Input OK";
> >     }
> >     document.getElementById("demo").innerHTML = text;
> > ]]></cdata:script>
> > </cdata:cdata>
> > '''
> > cdata.parseString(xml)
> >
> > It incorrectly strips out the CDATA tags:
> >
> > parseString spits out xml with the CDATA tags removed.:
> >
> > <?xml version="1.0" ?>
> > <cdata:cdata xmlns:cdata="urn:cdata">
> >     <cdata:script>
> >     var x, text;
> >
> >     // Get the value of the input field with id="numb"
> >     x = document.getElementById("numb  one").value;
> >
> >     // If x is Not a Number or less than one or greater than 10
> >     if (isNaN(x) || x &amp;lt; 1 || x &amp;gt; 10) {
> >         text = "Input not valid";
> >     } else {
> >         text = "Input OK";
> >     }
> >     document.getElementById("demo").innerHTML = text;
> > </cdata:script>
> > </cdata:cdata>
> >
> >
> > And on printing the script specifically, I also don't have CDATA tags
> > anymore and the < and > are xml encoded.
> >
> > print cdataObj.get_script()
> >
> >     var x, text;
> >
> >     // Get the value of the input field with id="numb"
> >     x = document.getElementById("numb  one").value;
> >
> >     // If x is Not a Number or less than one or greater than 10
> >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
> >         text = "Input not valid";
> >     } else {
> >         text = "Input OK";
> >     }
> >     document.getElementById("demo").innerHTML = text;
> >
> >
> > I'll see if I can track this down also. If you could give me a hint of
> > where to look that would be helpful.
> >
> > Thanks,
> > George
>
>
> --
>
> Dave Kuhlman
> http://www.davekuhlman.org
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/

_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Re: [Generateds-users] improper CDATA handling.

Reply via email to