Hi Dave,

Thanks so much for the update.

I've built a new parser using the supplied patch.

For the simple case it's working fine, for more complex XSDs it seems to be
hit and miss as to whether the CDATA is held over from parse to export.

I've put together a zip of a test case and emailed it to you separately.

There are simple scripts to create the parser and run it, the output of the
run is the result of an export, you can then compare that to the input.

I've not been able to identify exactly what the issue is but I'll keep
looking at it.

Cheers

Adrian



On Sun, Apr 26, 2015 at 1:07 PM, Dave Kuhlman <dkuhl...@davekuhlman.org>
wrote:

> On Fri, Apr 24, 2015 at 04:32:17PM +1000, Adrian Cook wrote:
> > Hi Dave Kuhlman and the list,
> >
> > Thanks so much for generateDS, I've been using a generateDS parser
> > successfully on millions of XML files.
> >
> > I have a question regarding CDATA.
> >
> > I'm developing in an environment where I am processing thousands of XML
> > files per minute and I've been using generateDS to create the parser for
> > processing. These files all contain a lot of CDATA and are from third
> > parties.
> >
> > It's been fine up to now because I am only using generateDS to parse the
> > xml and make decisions based on that. I now have a requirement to mutate
> > the data loaded using parser and export new XML.
> >
> > What I am finding is that the CDATA start and end markup is lost from the
> > exported text .
> >
> > I've pasted an example at the bottom. This is all pretty much vanilla use
> > of generateDS, parse and export using some of the unit test XSD and XML
> > files.
> >
> > I've read through the list archive and noted a correspondence where this
> is
> > mentioned
> >
> > Are there any plans or approaches to address this in generateDS? I can
> see
> > some comments in the list archive where this issue is mentioned and that
> > also indicate CDATA is a poor decision and should be avoided,
> unfortunately
> > I cannot change the dependence of my system on CDATA.
>
> Adrian,
>
> Good to hear from you.  I'm glad that generateDS has been useful.
>
> Short story first --
>
> I've patched so that it has support for this.  Specifically, if you
> run generateDS.py with the new command line option
> "--preserve-cdata-tags", then the generated code will preserve the
> CDATA tags, that is the resulting string values will contain
> "<![CDATA[" and "]]>".  And, if you do *not* use the
> "--preserve-cdata-tags" command line option, then the behavior is
> unchanged.
>
> I've attached a patched version of generateDS.py in a separate
> email.  Please let me know if this does what you expect and need.
>
> I'm going on vacation for 3 days next week.  I've got a chance to go
> car campling on the north California, USA coast near Ft. Bragg.
> But, I'll look into this some more when I return.
>
> In the meantime, thanks for reporting this.
>
> And now, the long story -- You can ignore the following unless you
> want to learn more (maybe) about CDATA and my thinking while trying
> to work my way through this.
>
> So, let's try to be (pedantically) specific about what the problem
> is:
>
> 1. generateDS handles CDATA on import/parsing (actually lxml does
>    this for us).  Good.
>
> 2. generateDS handles text on output/export even when there are
>    special characters in CDATA sections by escaping those special
>    characters as XML entities (e.g. "&lt;").  Good.
>
> 3. generateDS does *not* preserve CDATA sections on output/export.
>    Bad for some applications.
>
> There are difficulties with handling item 3, above.  Lxml normally
> throws away the CDATA tags when it parses a document.  I thought
> there was no way around this.  However, while thinking about your
> question, this morning, I decided to do one more Web search.
> Actually, George David, another list member who has done some work
> on this, had earlier pointing me at this ability, but I did not read
> carefully enough.  Anyway, I found that there is a way to preserve
> those CDATA tags by creating a special parser:
>
>     from lxml import etree
>
>     def test():
>         p = etree.XMLParser(strip_cdata=False)
>         d1 = etree.parse('test01.xml', parser=p)
>         r1 = d1.getroot()
>         print etree.tostring(r1)
>
>     test()
>
> That would seem to suggest that all we have to do in the generated
> code is to create a special "strip_cdata=False" parser, which would
> be a simple 1 or 2 line change.  But ...
>
> There is still a problem.  The only way to get the text with the
> CDATA tags included is to serialize the element, and when you do so,
> you get the surrounding XML tags as well.  For example, with the
> sample data that you include below, when you do this:
>
>     etree.tostring(element)
>
> we'd get something like:
>
>     <script><![CDATA[ccc < ddd & eee]]></script>
>
> So, in order to capture the text *with* CDATA tags, we'd have to do
> something like the following:
>
>     value1 = etree.tostring(element).strip()
>     mo = re.search(r'^<.+?>(.*)<.+>$', value1)
>     value2 = mo.group(1)
>
> Ick.  Maybe even: Yuck.
>
> OK.  I'm over-reacting.  And, it can be made prettier by
> pre-compiling the regular expression.
>
> I'd rather not make that general change, since this feature is
> seldom needed, I believe.  Most users will not want to deal with the
> <![CDATA[" and "]]>".
>
> We could add (yet) another command line option to turn on this special
> behavior.
>
> OK.  I gave it a shot.  Added the command line option
> ("--preserve-cdata-tags").  Seems to work, but definitely needs more
> testing.  I've attached a patched version of generateDS.py (in a
> separate email so as not to shove a large email into the list).
>
> Memo to Dave -- From now on, do less whining, and write more code.
> Although, it is a good idea to think these things through, first.
>
> Let me know if you really do need this behavior.  Also, let me know
> if I've really implemented the behavior that you need.  Then, I'll
> work on it a bit more, do more testing, create a unit test, etc.
>
> And, you can find more information about handling CDATA sections
> here:
>
> - http://lxml.de/api.html#cdata
>
> - http://lxml.de/parsing.html#parser-options
>
> - http://lxml.de/FAQ.html#parsing-and-serialisation
>
> -
> https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html
>
>   This comment at the end of the above email thread:
>
>      "I wouldn't bother. CDATA[] is more of a convenience
>       work-around when you are manually editing XML. In generated
>       XML, it's not very useful."
>
>   should be a caution to us not to get too enthusiastic about using
>   CDATA section, although it sounds like in your case, it's needed.
>
> Sorry, for being so wordy.  I needed to get myself to think this
> through.
>
> Dave
>
> >
> > Thanks in advance for any pointers.
> >
> > Adrian Cook
> >
> > Source XML:
> >
> > <cdataListType>
> >     <cdatalist>
> >         <script><![CDATA[ccc < ddd & eee]]></script>
> >     </cdatalist>
> >     <cdatalist>
> >         <script>aaa &lt; bbb <![CDATA[ccc < ddd]]> eee &lt; &amp;
> > fff&lt;<![CDATA[ggg < & hhh]]>&amp; iii &lt; jjj</script>
> >     </cdatalist>
> > </cdataListType>
> >
> > After export:
> >
> > <cdataListType>
> >     <cdatalist>
> >         <script>ccc &lt; ddd &amp; eee</script>
> >     </cdatalist>
> >     <cdatalist>
> >         <script>aaa &lt; bbb ccc &lt; ddd eee &lt; &amp; fff&lt;ggg &lt;
> > &amp; hhh&amp; iii &lt; jjj</script>
> >     </cdatalist>
> > </cdataListType>
>
> --
>
> Dave Kuhlman
> http://www.davekuhlman.org
>
>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Reply via email to