On Mon, Apr 27, 2015 at 01:10:03PM +1000, Adrian Cook wrote: > Hi Dave, > Thanks so much for the update.A > I've built a new parser using the supplied patch.A > For the simple case it's working fine, for more complex XSDs it seems > to be hit and miss as to whether the CDATA is held over from parse to > export. > I've put together a zip of a test case and emailed it to you > separately. > There are simple scripts to create the parser and run it, the output of > the run is the result of an export, you can then compare that to the > input. > I've not been able to identify exactly what the issue is but I'll keep > looking at it.
Adrian, Thanks for the test files that I can work with. As I mentioned, I'll be out of town for most of the week. I'll get on this when I return on Friday. If you find something that you think is a clue, let me know. There is a good chance that I have not covered all the possible cases, yet. For example, there may be an additional simpleType that we need to handle. Dave > Cheers > Adrian > > On Sun, Apr 26, 2015 at 1:07 PM, Dave Kuhlman > <[1]dkuhl...@davekuhlman.org> wrote: > > On Fri, Apr 24, 2015 at 04:32:17PM +1000, Adrian Cook wrote: > > Hi Dave Kuhlman and the list, > > > > Thanks so much for generateDS, I've been using a generateDS parser > > successfully on millions of XML files. > > > > I have a question regarding CDATA. > > > > I'm developing in an environment where I am processing thousands > of XML > > files per minute and I've been using generateDS to create the > parser for > > processing. These files all contain a lot of CDATA and are from > third > > parties. > > > > It's been fine up to now because I am only using generateDS to > parse the > > xml and make decisions based on that. I now have a requirement to > mutate > > the data loaded using parser and export new XML. > > > > What I am finding is that the CDATA start and end markup is lost > from the > > exported text . > > > > I've pasted an example at the bottom. This is all pretty much > vanilla use > > of generateDS, parse and export using some of the unit test XSD > and XML > > files. > > > > I've read through the list archive and noted a correspondence > where this is > > mentioned > > > > Are there any plans or approaches to address this in generateDS? I > can see > > some comments in the list archive where this issue is mentioned > and that > > also indicate CDATA is a poor decision and should be avoided, > unfortunately > > I cannot change the dependence of my system on CDATA. > Adrian, > Good to hear from you.A I'm glad that generateDS has been useful. > Short story first -- > I've patched so that it has support for this.A Specifically, if you > run generateDS.py with the new command line option > "--preserve-cdata-tags", then the generated code will preserve the > CDATA tags, that is the resulting string values will contain > "<![CDATA[" and "]]>".A And, if you do *not* use the > "--preserve-cdata-tags" command line option, then the behavior is > unchanged. > I've attached a patched version of generateDS.py in a separate > email.A Please let me know if this does what you expect and need. > I'm going on vacation for 3 days next week.A I've got a chance to > go > car campling on the north California, USA coast near Ft. Bragg. > But, I'll look into this some more when I return. > In the meantime, thanks for reporting this. > And now, the long story -- You can ignore the following unless you > want to learn more (maybe) about CDATA and my thinking while trying > to work my way through this. > So, let's try to be (pedantically) specific about what the problem > is: > 1. generateDS handles CDATA on import/parsing (actually lxml does > A A this for us).A Good. > 2. generateDS handles text on output/export even when there are > A A special characters in CDATA sections by escaping those special > A A characters as XML entities (e.g. "<").A Good. > 3. generateDS does *not* preserve CDATA sections on output/export. > A A Bad for some applications. > There are difficulties with handling item 3, above.A Lxml normally > throws away the CDATA tags when it parses a document.A I thought > there was no way around this.A However, while thinking about your > question, this morning, I decided to do one more Web search. > Actually, George David, another list member who has done some work > on this, had earlier pointing me at this ability, but I did not read > carefully enough.A Anyway, I found that there is a way to preserve > those CDATA tags by creating a special parser: > A A from lxml import etree > A A def test(): > A A A A p = etree.XMLParser(strip_cdata=False) > A A A A d1 = etree.parse('test01.xml', parser=p) > A A A A r1 = d1.getroot() > A A A A print etree.tostring(r1) > A A test() > That would seem to suggest that all we have to do in the generated > code is to create a special "strip_cdata=False" parser, which would > be a simple 1 or 2 line change.A But ... > There is still a problem.A The only way to get the text with the > CDATA tags included is to serialize the element, and when you do so, > you get the surrounding XML tags as well.A For example, with the > sample data that you include below, when you do this: > A A etree.tostring(element) > we'd get something like: > A A <script><![CDATA[ccc < ddd & eee]]></script> > So, in order to capture the text *with* CDATA tags, we'd have to do > something like the following: > A A value1 = etree.tostring(element).strip() > A A mo = re.search(r'^<.+?>(.*)<.+>$', value1) > A A value2 = mo.group(1) > Ick.A Maybe even: Yuck. > OK.A I'm over-reacting.A And, it can be made prettier by > pre-compiling the regular expression. > I'd rather not make that general change, since this feature is > seldom needed, I believe.A Most users will not want to deal with > the > <![CDATA[" and "]]>". > We could add (yet) another command line option to turn on this > special > behavior. > OK.A I gave it a shot.A Added the command line option > ("--preserve-cdata-tags").A Seems to work, but definitely needs > more > testing.A I've attached a patched version of generateDS.py (in a > separate email so as not to shove a large email into the list). > Memo to Dave -- From now on, do less whining, and write more code. > Although, it is a good idea to think these things through, first. > Let me know if you really do need this behavior.A Also, let me know > if I've really implemented the behavior that you need.A Then, I'll > work on it a bit more, do more testing, create a unit test, etc. > And, you can find more information about handling CDATA sections > here: > - [2]http://lxml.de/api.html#cdata > - [3]http://lxml.de/parsing.html#parser-options > - [4]http://lxml.de/FAQ.html#parsing-and-serialisation > - > [5]https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February > /007409.html > A This comment at the end of the above email thread: > A A A "I wouldn't bother. CDATA[] is more of a convenience > A A A work-around when you are manually editing XML. In generated > A A A XML, it's not very useful." > A should be a caution to us not to get too enthusiastic about using > A CDATA section, although it sounds like in your case, it's needed. > Sorry, for being so wordy.A I needed to get myself to think this > through. > Dave > > > > > Thanks in advance for any pointers. > > > > Adrian Cook > > > > Source XML: > > > > <cdataListType> > >A A A <cdatalist> > >A A A A A <script><![CDATA[ccc < ddd & eee]]></script> > >A A A </cdatalist> > >A A A <cdatalist> > >A A A A A <script>aaa < bbb <![CDATA[ccc < ddd]]> eee < > & > > fff<<![CDATA[ggg < & hhh]]>& iii < jjj</script> > >A A A </cdatalist> > > </cdataListType> > > > > After export: > > > > <cdataListType> > >A A A <cdatalist> > >A A A A A <script>ccc < ddd & eee</script> > >A A A </cdatalist> > >A A A <cdatalist> > >A A A A A <script>aaa < bbb ccc < ddd eee < & > fff<ggg < > > & hhh& iii < jjj</script> > >A A A </cdatalist> > > </cdataListType> > > -- > Dave Kuhlman > [6]http://www.davekuhlman.org > > References > > 1. mailto:dkuhl...@davekuhlman.org > 2. http://lxml.de/api.html#cdata > 3. http://lxml.de/parsing.html#parser-options > 4. http://lxml.de/FAQ.html#parsing-and-serialisation > 5. > https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html > 6. http://www.davekuhlman.org/ -- Dave Kuhlman http://www.davekuhlman.org ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ generateds-users mailing list generateds-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/generateds-users