On Wed, Aug 05, 2015 at 02:25:28PM +1000, Adrian Cook wrote:
> Hi Dave,
> 
> I'm back to having trouble with CDATA again, one of my CDATA entries in my
> XML is being truncated. I have identified where this is happening and have
> modified the Regex that's causing the problem, I thought I'd pass it on to
> see if you wanted to include it in your codebase.
> 
> In the function created by GenerateDS:
> def get_all_text_(node):
> 
> you use a regex to match the start and end of the tag to preserve the CDATA
> PRESERVE_CDATA_TAGS_PAT1 = re_.compile(r'^<.+?>(.*?)</?[a-zA-Z0-9\-]+>.*$')
> 
> However if the CDATA contains HTML then the Regex matches a closing tag in
> the CDATA and not the closing tag surrounding the CDATA
> 
> For example:
> <HTMLResource><![CDATA[<a href="http://google.com"/></a>]]></HTMLResource>
> With your regex extracts:
> <![CDATA[<a href="http://google.com"/></a>
> instead of
> <![CDATA[<a href="http://google.com"/></a>]]>
> 
> I have modified the regex to be up to the last closing tag:
> ^<.+?>(.*?)</?[a-zA-Z0-9\-]+>(?!.*</?[a-zA-Z0-9\-]+>)
> 
> and it matches correctly now.

Adrian,

Good to hear from you again.

I've done a test with both the old regex pattern and your new one.
My test shows that you are right.  The old one drops the ending
"]]>", whereas your new pattern successfully captures it.

So, I've updated the code in my version of generateDS.py with your
new pattern.

Thanks for this fix.

> 
> There is bound to be a more elegant way of doing the Regex but this worked
> for me.

I was unaware that there is such a thing as an elegant regular
expression.

Old regex joke:

    I have this problem.
    Maybe I can solve this problem with a regular expression.
    Oops.  Now, I have two problems.

Dave



-- 

Dave Kuhlman
http://www.davekuhlman.org

------------------------------------------------------------------------------
_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Reply via email to