On Fri, Apr 24, 2015 at 04:32:17PM +1000, Adrian Cook wrote:
> Hi Dave Kuhlman and the list,
> 
> Thanks so much for generateDS, I've been using a generateDS parser
> successfully on millions of XML files.
> 
> I have a question regarding CDATA.
> 
> I'm developing in an environment where I am processing thousands of XML
> files per minute and I've been using generateDS to create the parser for
> processing. These files all contain a lot of CDATA and are from third
> parties.
> 
> It's been fine up to now because I am only using generateDS to parse the
> xml and make decisions based on that. I now have a requirement to mutate
> the data loaded using parser and export new XML.
> 
> What I am finding is that the CDATA start and end markup is lost from the
> exported text .
> 
> I've pasted an example at the bottom. This is all pretty much vanilla use
> of generateDS, parse and export using some of the unit test XSD and XML
> files.
> 
> I've read through the list archive and noted a correspondence where this is
> mentioned
> 
> Are there any plans or approaches to address this in generateDS? I can see
> some comments in the list archive where this issue is mentioned and that
> also indicate CDATA is a poor decision and should be avoided, unfortunately
> I cannot change the dependence of my system on CDATA.

Adrian,

Good to hear from you.  I'm glad that generateDS has been useful.

Short story first --

I've patched so that it has support for this.  Specifically, if you
run generateDS.py with the new command line option
"--preserve-cdata-tags", then the generated code will preserve the
CDATA tags, that is the resulting string values will contain
"<![CDATA[" and "]]>".  And, if you do *not* use the
"--preserve-cdata-tags" command line option, then the behavior is
unchanged.

I've attached a patched version of generateDS.py in a separate
email.  Please let me know if this does what you expect and need.

I'm going on vacation for 3 days next week.  I've got a chance to go
car campling on the north California, USA coast near Ft. Bragg.
But, I'll look into this some more when I return.

In the meantime, thanks for reporting this.

And now, the long story -- You can ignore the following unless you
want to learn more (maybe) about CDATA and my thinking while trying
to work my way through this.

So, let's try to be (pedantically) specific about what the problem
is:

1. generateDS handles CDATA on import/parsing (actually lxml does
   this for us).  Good.

2. generateDS handles text on output/export even when there are
   special characters in CDATA sections by escaping those special
   characters as XML entities (e.g. "&lt;").  Good.

3. generateDS does *not* preserve CDATA sections on output/export.
   Bad for some applications.

There are difficulties with handling item 3, above.  Lxml normally
throws away the CDATA tags when it parses a document.  I thought
there was no way around this.  However, while thinking about your
question, this morning, I decided to do one more Web search.
Actually, George David, another list member who has done some work
on this, had earlier pointing me at this ability, but I did not read
carefully enough.  Anyway, I found that there is a way to preserve
those CDATA tags by creating a special parser:

    from lxml import etree

    def test():
        p = etree.XMLParser(strip_cdata=False)
        d1 = etree.parse('test01.xml', parser=p)
        r1 = d1.getroot()
        print etree.tostring(r1)

    test()

That would seem to suggest that all we have to do in the generated
code is to create a special "strip_cdata=False" parser, which would
be a simple 1 or 2 line change.  But ...

There is still a problem.  The only way to get the text with the
CDATA tags included is to serialize the element, and when you do so,
you get the surrounding XML tags as well.  For example, with the
sample data that you include below, when you do this:

    etree.tostring(element)

we'd get something like:

    <script><![CDATA[ccc < ddd & eee]]></script>

So, in order to capture the text *with* CDATA tags, we'd have to do
something like the following:

    value1 = etree.tostring(element).strip()
    mo = re.search(r'^<.+?>(.*)<.+>$', value1)
    value2 = mo.group(1)

Ick.  Maybe even: Yuck.

OK.  I'm over-reacting.  And, it can be made prettier by
pre-compiling the regular expression.

I'd rather not make that general change, since this feature is
seldom needed, I believe.  Most users will not want to deal with the
<![CDATA[" and "]]>".

We could add (yet) another command line option to turn on this special
behavior.

OK.  I gave it a shot.  Added the command line option
("--preserve-cdata-tags").  Seems to work, but definitely needs more
testing.  I've attached a patched version of generateDS.py (in a
separate email so as not to shove a large email into the list).

Memo to Dave -- From now on, do less whining, and write more code.
Although, it is a good idea to think these things through, first.

Let me know if you really do need this behavior.  Also, let me know
if I've really implemented the behavior that you need.  Then, I'll
work on it a bit more, do more testing, create a unit test, etc.

And, you can find more information about handling CDATA sections
here:

- http://lxml.de/api.html#cdata

- http://lxml.de/parsing.html#parser-options

- http://lxml.de/FAQ.html#parsing-and-serialisation

- https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html

  This comment at the end of the above email thread:

     "I wouldn't bother. CDATA[] is more of a convenience
      work-around when you are manually editing XML. In generated
      XML, it's not very useful."

  should be a caution to us not to get too enthusiastic about using
  CDATA section, although it sounds like in your case, it's needed.

Sorry, for being so wordy.  I needed to get myself to think this
through.

Dave

> 
> Thanks in advance for any pointers.
> 
> Adrian Cook
> 
> Source XML:
> 
> <cdataListType>
>     <cdatalist>
>         <script><![CDATA[ccc < ddd & eee]]></script>
>     </cdatalist>
>     <cdatalist>
>         <script>aaa &lt; bbb <![CDATA[ccc < ddd]]> eee &lt; &amp;
> fff&lt;<![CDATA[ggg < & hhh]]>&amp; iii &lt; jjj</script>
>     </cdatalist>
> </cdataListType>
> 
> After export:
> 
> <cdataListType>
>     <cdatalist>
>         <script>ccc &lt; ddd &amp; eee</script>
>     </cdatalist>
>     <cdatalist>
>         <script>aaa &lt; bbb ccc &lt; ddd eee &lt; &amp; fff&lt;ggg &lt;
> &amp; hhh&amp; iii &lt; jjj</script>
>     </cdatalist>
> </cdataListType>

-- 

Dave Kuhlman
http://www.davekuhlman.org

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Reply via email to