On Wed, Aug 05, 2015 at 02:25:28PM +1000, Adrian Cook wrote: > Hi Dave, > > I'm back to having trouble with CDATA again, one of my CDATA entries in my > XML is being truncated. I have identified where this is happening and have > modified the Regex that's causing the problem, I thought I'd pass it on to > see if you wanted to include it in your codebase. > > In the function created by GenerateDS: > def get_all_text_(node): > > you use a regex to match the start and end of the tag to preserve the CDATA > PRESERVE_CDATA_TAGS_PAT1 = re_.compile(r'^<.+?>(.*?)</?[a-zA-Z0-9\-]+>.*$') > > However if the CDATA contains HTML then the Regex matches a closing tag in > the CDATA and not the closing tag surrounding the CDATA > > For example: > <HTMLResource><![CDATA[<a href="http://google.com"/></a>]]></HTMLResource> > With your regex extracts: > <![CDATA[<a href="http://google.com"/></a> > instead of > <![CDATA[<a href="http://google.com"/></a>]]> > > I have modified the regex to be up to the last closing tag: > ^<.+?>(.*?)</?[a-zA-Z0-9\-]+>(?!.*</?[a-zA-Z0-9\-]+>) > > and it matches correctly now.
Adrian, Good to hear from you again. I've done a test with both the old regex pattern and your new one. My test shows that you are right. The old one drops the ending "]]>", whereas your new pattern successfully captures it. So, I've updated the code in my version of generateDS.py with your new pattern. Thanks for this fix. > > There is bound to be a more elegant way of doing the Regex but this worked > for me. I was unaware that there is such a thing as an elegant regular expression. Old regex joke: I have this problem. Maybe I can solve this problem with a regular expression. Oops. Now, I have two problems. Dave -- Dave Kuhlman http://www.davekuhlman.org ------------------------------------------------------------------------------ _______________________________________________ generateds-users mailing list generateds-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/generateds-users