On Fri, Apr 24, 2015 at 04:32:17PM +1000, Adrian Cook wrote: > Hi Dave Kuhlman and the list, > > Thanks so much for generateDS, I've been using a generateDS parser > successfully on millions of XML files. > > I have a question regarding CDATA. > > I'm developing in an environment where I am processing thousands of XML > files per minute and I've been using generateDS to create the parser for > processing. These files all contain a lot of CDATA and are from third > parties. > > It's been fine up to now because I am only using generateDS to parse the > xml and make decisions based on that. I now have a requirement to mutate > the data loaded using parser and export new XML. > > What I am finding is that the CDATA start and end markup is lost from the > exported text . > > I've pasted an example at the bottom. This is all pretty much vanilla use > of generateDS, parse and export using some of the unit test XSD and XML > files. > > I've read through the list archive and noted a correspondence where this is > mentioned > > Are there any plans or approaches to address this in generateDS? I can see > some comments in the list archive where this issue is mentioned and that > also indicate CDATA is a poor decision and should be avoided, unfortunately > I cannot change the dependence of my system on CDATA.
Adrian, Good to hear from you. I'm glad that generateDS has been useful. Short story first -- I've patched so that it has support for this. Specifically, if you run generateDS.py with the new command line option "--preserve-cdata-tags", then the generated code will preserve the CDATA tags, that is the resulting string values will contain "<![CDATA[" and "]]>". And, if you do *not* use the "--preserve-cdata-tags" command line option, then the behavior is unchanged. I've attached a patched version of generateDS.py in a separate email. Please let me know if this does what you expect and need. I'm going on vacation for 3 days next week. I've got a chance to go car campling on the north California, USA coast near Ft. Bragg. But, I'll look into this some more when I return. In the meantime, thanks for reporting this. And now, the long story -- You can ignore the following unless you want to learn more (maybe) about CDATA and my thinking while trying to work my way through this. So, let's try to be (pedantically) specific about what the problem is: 1. generateDS handles CDATA on import/parsing (actually lxml does this for us). Good. 2. generateDS handles text on output/export even when there are special characters in CDATA sections by escaping those special characters as XML entities (e.g. "<"). Good. 3. generateDS does *not* preserve CDATA sections on output/export. Bad for some applications. There are difficulties with handling item 3, above. Lxml normally throws away the CDATA tags when it parses a document. I thought there was no way around this. However, while thinking about your question, this morning, I decided to do one more Web search. Actually, George David, another list member who has done some work on this, had earlier pointing me at this ability, but I did not read carefully enough. Anyway, I found that there is a way to preserve those CDATA tags by creating a special parser: from lxml import etree def test(): p = etree.XMLParser(strip_cdata=False) d1 = etree.parse('test01.xml', parser=p) r1 = d1.getroot() print etree.tostring(r1) test() That would seem to suggest that all we have to do in the generated code is to create a special "strip_cdata=False" parser, which would be a simple 1 or 2 line change. But ... There is still a problem. The only way to get the text with the CDATA tags included is to serialize the element, and when you do so, you get the surrounding XML tags as well. For example, with the sample data that you include below, when you do this: etree.tostring(element) we'd get something like: <script><![CDATA[ccc < ddd & eee]]></script> So, in order to capture the text *with* CDATA tags, we'd have to do something like the following: value1 = etree.tostring(element).strip() mo = re.search(r'^<.+?>(.*)<.+>$', value1) value2 = mo.group(1) Ick. Maybe even: Yuck. OK. I'm over-reacting. And, it can be made prettier by pre-compiling the regular expression. I'd rather not make that general change, since this feature is seldom needed, I believe. Most users will not want to deal with the <![CDATA[" and "]]>". We could add (yet) another command line option to turn on this special behavior. OK. I gave it a shot. Added the command line option ("--preserve-cdata-tags"). Seems to work, but definitely needs more testing. I've attached a patched version of generateDS.py (in a separate email so as not to shove a large email into the list). Memo to Dave -- From now on, do less whining, and write more code. Although, it is a good idea to think these things through, first. Let me know if you really do need this behavior. Also, let me know if I've really implemented the behavior that you need. Then, I'll work on it a bit more, do more testing, create a unit test, etc. And, you can find more information about handling CDATA sections here: - http://lxml.de/api.html#cdata - http://lxml.de/parsing.html#parser-options - http://lxml.de/FAQ.html#parsing-and-serialisation - https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html This comment at the end of the above email thread: "I wouldn't bother. CDATA[] is more of a convenience work-around when you are manually editing XML. In generated XML, it's not very useful." should be a caution to us not to get too enthusiastic about using CDATA section, although it sounds like in your case, it's needed. Sorry, for being so wordy. I needed to get myself to think this through. Dave > > Thanks in advance for any pointers. > > Adrian Cook > > Source XML: > > <cdataListType> > <cdatalist> > <script><![CDATA[ccc < ddd & eee]]></script> > </cdatalist> > <cdatalist> > <script>aaa < bbb <![CDATA[ccc < ddd]]> eee < & > fff<<![CDATA[ggg < & hhh]]>& iii < jjj</script> > </cdatalist> > </cdataListType> > > After export: > > <cdataListType> > <cdatalist> > <script>ccc < ddd & eee</script> > </cdatalist> > <cdatalist> > <script>aaa < bbb ccc < ddd eee < & fff<ggg < > & hhh& iii < jjj</script> > </cdatalist> > </cdataListType> -- Dave Kuhlman http://www.davekuhlman.org ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ generateds-users mailing list generateds-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/generateds-users