Re: [Generateds-users] improper CDATA handling.

George David Thu, 12 Feb 2015 07:08:22 -0800

Hi Dave,

I see on sourceforge that you responded to my last email, but I have not
received it yet. Would you resend it? There were some files you asked me to
look at.


Thanks,
George

On Mon Feb 09 2015 at 8:37:48 AM George David <[email protected]> wrote:

> Hi Dave,
>
> I was hoping to reply almost immediately to the email I sent, but it was
> blocked awaiting approval. Ah well...
>
> After I sent the email I tracked down the problem of encoding the actual
> CDATA tags to the quote_xml function. I added the following regex which
> allows for single and multiline text:
>
> regEx = re.compile("<!\[CDATA\[.*\]\]>", re.DOTALL)
> match = regEx.match(s1)
> if match:
>     #it's wrapped in data tags, no need to encode it
> return s1
>
> I'll submit that later on today unless you have any objections to it.
>
> I've done a lot of research regarding CTAGS and event started a
> conversation on the LXML mailing list which can be seen here:
>
>
> https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html
>
> At any rate, one interesting tidbit of information was the reply to query
> about how to determine if a node had CDATA tags:
>
> I wouldn't bother. CDATA[] is more of a convenience work-around when you
> are manually editing XML. In generated XML, it's not very useful.
>
>
> I have looked around but I haven't come up with anything definitive on
> this. I did notice that if I didn't use any CDATA tags, the javascript code
> was arrived just fine on the other end once it was decoded. I found this
> FAQ that was originally started by w3 according to their home page. It says
> that you should "almost never" use CDATA.
>
> http://xml.silmaril.ie/cdata.html
>
> What are your thoughts on this? I'm leaning towards no worrying about the
> CDATA tags that lxml stripped out since in practice they really don't seem
> to matter.
>
> As far as the multiple namespaces are concerned, I think it was a
> combination of adding the xml catalog support and the --one-file-per-xsd
> support, though I guess the xml catalog is not strictly necessary since the
> schema locations may all be defined in the import/include statements. I
> only tested it using the xml catalog since that's what we use.
>
> Honestly I don't recall completely, but I believe that prior to my commits
> there was some support already in generateDS for multiple namespaces, but I
> found places where it wasn't correct. In the project I work on, we have
> over 400 unique xsd files each with their own namespace. Almos all of them
> import another XSD. In one case I see that we have a single XSD importing 9
> other ones. Most of our XSDs are handled properly by generateDS when using
> the --one-file-per-xsd option but we do encounter some problems like the
> one reported here:
>
> http://sourceforge.net/p/generateds/mailman/message/33012215/
>
> We probably have about 10 or so problems with the generated python code
> and I believe this is the only one that is related to namespaces.
>
> On Sun Feb 08 2015 at 8:38:32 PM Dave Kuhlman <[email protected]>
> wrote:
>
>> On Fri, Feb 06, 2015 at 01:05:42AM +0000, George David wrote:
>> > Hi Dave,
>> >
>> > I created a xsd that has an element called script. The intent is to
>> allow
>> > users to send us javascript that is encoded with CDATA tags.
>> >
>> > In the attached files you can see that I set the script variable as
>> follows:
>> > cdataObj = Cdata()
>> >
>> > script='''<![CDATA[
>> >     var x, text;
>> >
>> >     // Get the value of the input field with id="numb"
>> >     x = document.getElementById("numb  one").value;
>> >
>> >     // If x is Not a Number or less than one or greater than 10
>> >     if (isNaN(x) || x < 1 || x > 10) {
>> >         text = "Input not valid";
>> >     } else {
>> >         text = "Input OK";
>> >     }
>> >     document.getElementById("demo").innerHTML = text;
>> > ]]>'''
>> > cdataObj.set_script(script)
>> >
>> > I exported it:
>> >
>> > cdataObj.export(sys.stdout, 0, name_='cdata')
>> >
>> > And got the following:
>> >
>> > <cdata:cdata xmlns:cdata="urn:cdata">
>> >     <cdata:script>&lt;![CDATA[
>> >     var x, text;
>> >
>> >     // Get the value of the input field with id="numb"
>> >     x = document.getElementById("numb  one").value;
>> >
>> >     // If x is Not a Number or less than one or greater than 10
>> >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
>> >         text = "Input not valid";
>> >     } else {
>> >         text = "Input OK";
>> >     }
>> >     document.getElementById("demo").innerHTML = text;
>> > ]]&gt;</cdata:script>
>> > </cdata:cdata>
>> >
>> > Note that the CDATA wrappers have been encoded <![CDATA[ has been
>> changed
>> > to &lt;![CDATA[ and ]]> has been changed to ]]&gt;
>>
>> George,
>>
>> Good to hear from you again.
>>
>> One solution to the above is to use a more intelligent replacement.
>> The attached patch uses the re module and two regular expressions to
>> replace (escape) "<" and ">" without replacing "<![CDATA[" and "]]>".
>>
>> >
>> > Also notice that the < and > signs in the java script have also been
>> > encoded. I believe there should be code to check for the CDATA tags and
>> not
>> > xml encode it if they exist. I'll try to track this down in the code
>> but I
>> > wanted to make sure this wasn't done on purpose.
>> >
>> > There is another problem with CDATA. If I create an xml string with
>> CDATA
>> > and parse it like this:
>>
>> Re: the missing CDATA wrappers:
>>
>> The problem is that when the generated code uses lxml to parse an
>> XML instance doc, lxml strips away the "<![CDATA[" and "]]>".  I
>> don't believe that we can even tell that they were there in the
>> first place.  The attached script (cdata_demo.py) attempts to
>> demonstrate this.
>>
>> So, after that XML instance doc has been parsed, there is no way to
>> tell that the CDATA tags were there in the first place.
>>
>> Wait ... I did one more Web search ...
>>
>> It's even the case that lxml has a special provision for this issue.
>> I found this: http://lxml.de/api.html#cdata
>>
>> (It's incredible what kind of hidden information you can find with a
>> Web search engine.  You should try one sometime.  But, seriously, ...)
>>
>> However, when you use ``element.text`` to capture the text data, the
>> CDATA tags are still missing, even though when you use
>> ``etree.tostring(some_element)`` they are there.
>>
>> I haven't figured out how to deal with this, yet.  I'll think a bit
>> more on it.
>>
>> If you can think of a work-around for this, please let me know.
>>
>> On an unrelated subject -- generateDS.py does not handle multiple
>> namespaces in the same XML schema, in particular when ``<xs:import
>> ...>`` is used.  I've had several reports about this.  If I recall
>> correctly, you contributed the code that implements
>> --one-file-per-xsd.  I'm wondering if that might be helpful in some
>> of these situations.  If you have any comments or suggestions about
>> this, I'd be interested in hearing them.
>>
>> And, have you had any experience with lxml.objectify?
>> (http://lxml.de/objectify.html)  I'm wondering whether it might
>> solve some of these problems (in particular the namespaces and CDATA
>> issues) better that generateDS.py does.  Maybe we can learn
>> something from it.
>>
>> More later.
>>
>> Dave
>>
>> >
>> > xml='''
>> > <cdata:cdata xmlns:cdata="urn:cdata">
>> >     <cdata:script><![CDATA[
>> >     var x, text;
>> >
>> >     // Get the value of the input field with id="numb"
>> >     x = document.getElementById("numb  one").value;
>> >
>> >     // If x is Not a Number or less than one or greater than 10
>> >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
>> >         text = "Input not valid";
>> >     } else {
>> >         text = "Input OK";
>> >     }
>> >     document.getElementById("demo").innerHTML = text;
>> > ]]></cdata:script>
>> > </cdata:cdata>
>> > '''
>> > cdata.parseString(xml)
>> >
>> > It incorrectly strips out the CDATA tags:
>> >
>> > parseString spits out xml with the CDATA tags removed.:
>> >
>> > <?xml version="1.0" ?>
>> > <cdata:cdata xmlns:cdata="urn:cdata">
>> >     <cdata:script>
>> >     var x, text;
>> >
>> >     // Get the value of the input field with id="numb"
>> >     x = document.getElementById("numb  one").value;
>> >
>> >     // If x is Not a Number or less than one or greater than 10
>> >     if (isNaN(x) || x &amp;lt; 1 || x &amp;gt; 10) {
>> >         text = "Input not valid";
>> >     } else {
>> >         text = "Input OK";
>> >     }
>> >     document.getElementById("demo").innerHTML = text;
>> > </cdata:script>
>> > </cdata:cdata>
>> >
>> >
>> > And on printing the script specifically, I also don't have CDATA tags
>> > anymore and the < and > are xml encoded.
>> >
>> > print cdataObj.get_script()
>> >
>> >     var x, text;
>> >
>> >     // Get the value of the input field with id="numb"
>> >     x = document.getElementById("numb  one").value;
>> >
>> >     // If x is Not a Number or less than one or greater than 10
>> >     if (isNaN(x) || x &lt; 1 || x &gt; 10) {
>> >         text = "Input not valid";
>> >     } else {
>> >         text = "Input OK";
>> >     }
>> >     document.getElementById("demo").innerHTML = text;
>> >
>> >
>> > I'll see if I can track this down also. If you could give me a hint of
>> > where to look that would be helpful.
>> >
>> > Thanks,
>> > George
>>
>>
>> --
>>
>> Dave Kuhlman
>> http://www.davekuhlman.org
>>
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/

_______________________________________________
generateds-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/generateds-users

Re: [Generateds-users] improper CDATA handling.

Reply via email to