Hi Dave, I'm back to having trouble with CDATA again, one of my CDATA entries in my XML is being truncated. I have identified where this is happening and have modified the Regex that's causing the problem, I thought I'd pass it on to see if you wanted to include it in your codebase.
In the function created by GenerateDS: def get_all_text_(node): you use a regex to match the start and end of the tag to preserve the CDATA PRESERVE_CDATA_TAGS_PAT1 = re_.compile(r'^<.+?>(.*?)</?[a-zA-Z0-9\-]+>.*$') However if the CDATA contains HTML then the Regex matches a closing tag in the CDATA and not the closing tag surrounding the CDATA For example: <HTMLResource><![CDATA[<a href="http://google.com"/></a>]]></HTMLResource> With your regex extracts: <![CDATA[<a href="http://google.com"/></a> instead of <![CDATA[<a href="http://google.com"/></a>]]> I have modified the regex to be up to the last closing tag: ^<.+?>(.*?)</?[a-zA-Z0-9\-]+>(?!.*</?[a-zA-Z0-9\-]+>) and it matches correctly now. There is bound to be a more elegant way of doing the Regex but this worked for me. Let me know if I've not been clear an/or you want more information. Cheers Adrian On Sun, Apr 26, 2015 at 1:07 PM, Dave Kuhlman <dkuhl...@davekuhlman.org> wrote: > On Fri, Apr 24, 2015 at 04:32:17PM +1000, Adrian Cook wrote: > > Hi Dave Kuhlman and the list, > > > > Thanks so much for generateDS, I've been using a generateDS parser > > successfully on millions of XML files. > > > > I have a question regarding CDATA. > > > > I'm developing in an environment where I am processing thousands of XML > > files per minute and I've been using generateDS to create the parser for > > processing. These files all contain a lot of CDATA and are from third > > parties. > > > > It's been fine up to now because I am only using generateDS to parse the > > xml and make decisions based on that. I now have a requirement to mutate > > the data loaded using parser and export new XML. > > > > What I am finding is that the CDATA start and end markup is lost from the > > exported text . > > > > I've pasted an example at the bottom. This is all pretty much vanilla use > > of generateDS, parse and export using some of the unit test XSD and XML > > files. > > > > I've read through the list archive and noted a correspondence where this > is > > mentioned > > > > Are there any plans or approaches to address this in generateDS? I can > see > > some comments in the list archive where this issue is mentioned and that > > also indicate CDATA is a poor decision and should be avoided, > unfortunately > > I cannot change the dependence of my system on CDATA. > > Adrian, > > Good to hear from you. I'm glad that generateDS has been useful. > > Short story first -- > > I've patched so that it has support for this. Specifically, if you > run generateDS.py with the new command line option > "--preserve-cdata-tags", then the generated code will preserve the > CDATA tags, that is the resulting string values will contain > "<![CDATA[" and "]]>". And, if you do *not* use the > "--preserve-cdata-tags" command line option, then the behavior is > unchanged. > > I've attached a patched version of generateDS.py in a separate > email. Please let me know if this does what you expect and need. > > I'm going on vacation for 3 days next week. I've got a chance to go > car campling on the north California, USA coast near Ft. Bragg. > But, I'll look into this some more when I return. > > In the meantime, thanks for reporting this. > > And now, the long story -- You can ignore the following unless you > want to learn more (maybe) about CDATA and my thinking while trying > to work my way through this. > > So, let's try to be (pedantically) specific about what the problem > is: > > 1. generateDS handles CDATA on import/parsing (actually lxml does > this for us). Good. > > 2. generateDS handles text on output/export even when there are > special characters in CDATA sections by escaping those special > characters as XML entities (e.g. "<"). Good. > > 3. generateDS does *not* preserve CDATA sections on output/export. > Bad for some applications. > > There are difficulties with handling item 3, above. Lxml normally > throws away the CDATA tags when it parses a document. I thought > there was no way around this. However, while thinking about your > question, this morning, I decided to do one more Web search. > Actually, George David, another list member who has done some work > on this, had earlier pointing me at this ability, but I did not read > carefully enough. Anyway, I found that there is a way to preserve > those CDATA tags by creating a special parser: > > from lxml import etree > > def test(): > p = etree.XMLParser(strip_cdata=False) > d1 = etree.parse('test01.xml', parser=p) > r1 = d1.getroot() > print etree.tostring(r1) > > test() > > That would seem to suggest that all we have to do in the generated > code is to create a special "strip_cdata=False" parser, which would > be a simple 1 or 2 line change. But ... > > There is still a problem. The only way to get the text with the > CDATA tags included is to serialize the element, and when you do so, > you get the surrounding XML tags as well. For example, with the > sample data that you include below, when you do this: > > etree.tostring(element) > > we'd get something like: > > <script><![CDATA[ccc < ddd & eee]]></script> > > So, in order to capture the text *with* CDATA tags, we'd have to do > something like the following: > > value1 = etree.tostring(element).strip() > mo = re.search(r'^<.+?>(.*)<.+>$', value1) > value2 = mo.group(1) > > Ick. Maybe even: Yuck. > > OK. I'm over-reacting. And, it can be made prettier by > pre-compiling the regular expression. > > I'd rather not make that general change, since this feature is > seldom needed, I believe. Most users will not want to deal with the > <![CDATA[" and "]]>". > > We could add (yet) another command line option to turn on this special > behavior. > > OK. I gave it a shot. Added the command line option > ("--preserve-cdata-tags"). Seems to work, but definitely needs more > testing. I've attached a patched version of generateDS.py (in a > separate email so as not to shove a large email into the list). > > Memo to Dave -- From now on, do less whining, and write more code. > Although, it is a good idea to think these things through, first. > > Let me know if you really do need this behavior. Also, let me know > if I've really implemented the behavior that you need. Then, I'll > work on it a bit more, do more testing, create a unit test, etc. > > And, you can find more information about handling CDATA sections > here: > > - http://lxml.de/api.html#cdata > > - http://lxml.de/parsing.html#parser-options > > - http://lxml.de/FAQ.html#parsing-and-serialisation > > - > https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html > > This comment at the end of the above email thread: > > "I wouldn't bother. CDATA[] is more of a convenience > work-around when you are manually editing XML. In generated > XML, it's not very useful." > > should be a caution to us not to get too enthusiastic about using > CDATA section, although it sounds like in your case, it's needed. > > Sorry, for being so wordy. I needed to get myself to think this > through. > > Dave > > > > > Thanks in advance for any pointers. > > > > Adrian Cook > > > > Source XML: > > > > <cdataListType> > > <cdatalist> > > <script><![CDATA[ccc < ddd & eee]]></script> > > </cdatalist> > > <cdatalist> > > <script>aaa < bbb <![CDATA[ccc < ddd]]> eee < & > > fff<<![CDATA[ggg < & hhh]]>& iii < jjj</script> > > </cdatalist> > > </cdataListType> > > > > After export: > > > > <cdataListType> > > <cdatalist> > > <script>ccc < ddd & eee</script> > > </cdatalist> > > <cdatalist> > > <script>aaa < bbb ccc < ddd eee < & fff<ggg < > > & hhh& iii < jjj</script> > > </cdatalist> > > </cdataListType> > > -- > > Dave Kuhlman > http://www.davekuhlman.org > >
------------------------------------------------------------------------------
_______________________________________________ generateds-users mailing list generateds-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/generateds-users