Hi Dave,

I'm back to having trouble with CDATA again, one of my CDATA entries in my
XML is being truncated. I have identified where this is happening and have
modified the Regex that's causing the problem, I thought I'd pass it on to
see if you wanted to include it in your codebase.

In the function created by GenerateDS:
def get_all_text_(node):

you use a regex to match the start and end of the tag to preserve the CDATA
PRESERVE_CDATA_TAGS_PAT1 = re_.compile(r'^<.+?>(.*?)</?[a-zA-Z0-9\-]+>.*$')

However if the CDATA contains HTML then the Regex matches a closing tag in
the CDATA and not the closing tag surrounding the CDATA

For example:
<HTMLResource><![CDATA[<a href="http://google.com"/></a>]]></HTMLResource>
With your regex extracts:
<![CDATA[<a href="http://google.com"/></a>
instead of
<![CDATA[<a href="http://google.com"/></a>]]>

I have modified the regex to be up to the last closing tag:
^<.+?>(.*?)</?[a-zA-Z0-9\-]+>(?!.*</?[a-zA-Z0-9\-]+>)

and it matches correctly now.

There is bound to be a more elegant way of doing the Regex but this worked
for me.

Let me know if I've not been clear an/or you want more information.

Cheers

Adrian






On Sun, Apr 26, 2015 at 1:07 PM, Dave Kuhlman <dkuhl...@davekuhlman.org>
wrote:

> On Fri, Apr 24, 2015 at 04:32:17PM +1000, Adrian Cook wrote:
> > Hi Dave Kuhlman and the list,
> >
> > Thanks so much for generateDS, I've been using a generateDS parser
> > successfully on millions of XML files.
> >
> > I have a question regarding CDATA.
> >
> > I'm developing in an environment where I am processing thousands of XML
> > files per minute and I've been using generateDS to create the parser for
> > processing. These files all contain a lot of CDATA and are from third
> > parties.
> >
> > It's been fine up to now because I am only using generateDS to parse the
> > xml and make decisions based on that. I now have a requirement to mutate
> > the data loaded using parser and export new XML.
> >
> > What I am finding is that the CDATA start and end markup is lost from the
> > exported text .
> >
> > I've pasted an example at the bottom. This is all pretty much vanilla use
> > of generateDS, parse and export using some of the unit test XSD and XML
> > files.
> >
> > I've read through the list archive and noted a correspondence where this
> is
> > mentioned
> >
> > Are there any plans or approaches to address this in generateDS? I can
> see
> > some comments in the list archive where this issue is mentioned and that
> > also indicate CDATA is a poor decision and should be avoided,
> unfortunately
> > I cannot change the dependence of my system on CDATA.
>
> Adrian,
>
> Good to hear from you.  I'm glad that generateDS has been useful.
>
> Short story first --
>
> I've patched so that it has support for this.  Specifically, if you
> run generateDS.py with the new command line option
> "--preserve-cdata-tags", then the generated code will preserve the
> CDATA tags, that is the resulting string values will contain
> "<![CDATA[" and "]]>".  And, if you do *not* use the
> "--preserve-cdata-tags" command line option, then the behavior is
> unchanged.
>
> I've attached a patched version of generateDS.py in a separate
> email.  Please let me know if this does what you expect and need.
>
> I'm going on vacation for 3 days next week.  I've got a chance to go
> car campling on the north California, USA coast near Ft. Bragg.
> But, I'll look into this some more when I return.
>
> In the meantime, thanks for reporting this.
>
> And now, the long story -- You can ignore the following unless you
> want to learn more (maybe) about CDATA and my thinking while trying
> to work my way through this.
>
> So, let's try to be (pedantically) specific about what the problem
> is:
>
> 1. generateDS handles CDATA on import/parsing (actually lxml does
>    this for us).  Good.
>
> 2. generateDS handles text on output/export even when there are
>    special characters in CDATA sections by escaping those special
>    characters as XML entities (e.g. "&lt;").  Good.
>
> 3. generateDS does *not* preserve CDATA sections on output/export.
>    Bad for some applications.
>
> There are difficulties with handling item 3, above.  Lxml normally
> throws away the CDATA tags when it parses a document.  I thought
> there was no way around this.  However, while thinking about your
> question, this morning, I decided to do one more Web search.
> Actually, George David, another list member who has done some work
> on this, had earlier pointing me at this ability, but I did not read
> carefully enough.  Anyway, I found that there is a way to preserve
> those CDATA tags by creating a special parser:
>
>     from lxml import etree
>
>     def test():
>         p = etree.XMLParser(strip_cdata=False)
>         d1 = etree.parse('test01.xml', parser=p)
>         r1 = d1.getroot()
>         print etree.tostring(r1)
>
>     test()
>
> That would seem to suggest that all we have to do in the generated
> code is to create a special "strip_cdata=False" parser, which would
> be a simple 1 or 2 line change.  But ...
>
> There is still a problem.  The only way to get the text with the
> CDATA tags included is to serialize the element, and when you do so,
> you get the surrounding XML tags as well.  For example, with the
> sample data that you include below, when you do this:
>
>     etree.tostring(element)
>
> we'd get something like:
>
>     <script><![CDATA[ccc < ddd & eee]]></script>
>
> So, in order to capture the text *with* CDATA tags, we'd have to do
> something like the following:
>
>     value1 = etree.tostring(element).strip()
>     mo = re.search(r'^<.+?>(.*)<.+>$', value1)
>     value2 = mo.group(1)
>
> Ick.  Maybe even: Yuck.
>
> OK.  I'm over-reacting.  And, it can be made prettier by
> pre-compiling the regular expression.
>
> I'd rather not make that general change, since this feature is
> seldom needed, I believe.  Most users will not want to deal with the
> <![CDATA[" and "]]>".
>
> We could add (yet) another command line option to turn on this special
> behavior.
>
> OK.  I gave it a shot.  Added the command line option
> ("--preserve-cdata-tags").  Seems to work, but definitely needs more
> testing.  I've attached a patched version of generateDS.py (in a
> separate email so as not to shove a large email into the list).
>
> Memo to Dave -- From now on, do less whining, and write more code.
> Although, it is a good idea to think these things through, first.
>
> Let me know if you really do need this behavior.  Also, let me know
> if I've really implemented the behavior that you need.  Then, I'll
> work on it a bit more, do more testing, create a unit test, etc.
>
> And, you can find more information about handling CDATA sections
> here:
>
> - http://lxml.de/api.html#cdata
>
> - http://lxml.de/parsing.html#parser-options
>
> - http://lxml.de/FAQ.html#parsing-and-serialisation
>
> -
> https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html
>
>   This comment at the end of the above email thread:
>
>      "I wouldn't bother. CDATA[] is more of a convenience
>       work-around when you are manually editing XML. In generated
>       XML, it's not very useful."
>
>   should be a caution to us not to get too enthusiastic about using
>   CDATA section, although it sounds like in your case, it's needed.
>
> Sorry, for being so wordy.  I needed to get myself to think this
> through.
>
> Dave
>
> >
> > Thanks in advance for any pointers.
> >
> > Adrian Cook
> >
> > Source XML:
> >
> > <cdataListType>
> >     <cdatalist>
> >         <script><![CDATA[ccc < ddd & eee]]></script>
> >     </cdatalist>
> >     <cdatalist>
> >         <script>aaa &lt; bbb <![CDATA[ccc < ddd]]> eee &lt; &amp;
> > fff&lt;<![CDATA[ggg < & hhh]]>&amp; iii &lt; jjj</script>
> >     </cdatalist>
> > </cdataListType>
> >
> > After export:
> >
> > <cdataListType>
> >     <cdatalist>
> >         <script>ccc &lt; ddd &amp; eee</script>
> >     </cdatalist>
> >     <cdatalist>
> >         <script>aaa &lt; bbb ccc &lt; ddd eee &lt; &amp; fff&lt;ggg &lt;
> > &amp; hhh&amp; iii &lt; jjj</script>
> >     </cdatalist>
> > </cdataListType>
>
> --
>
> Dave Kuhlman
> http://www.davekuhlman.org
>
>
------------------------------------------------------------------------------
_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Reply via email to