On Mon, Apr 27, 2015 at 01:10:03PM +1000, Adrian Cook wrote:
> Hi Dave, attached is a test case that exhibits the missing CDATA.
> 
> Cheers
> 

Adrian,

Thanks for those test cases.  That was very helpful.

I believe I have a fix.  It seems to do what we want it to do with
vast3_draft.xsd and sample.xml.

I've attached a patch, which I believe you can apply against that
last fix I sent to you.  But, it might be easier to get the complete file
(generateDS.py) at Bitbucket:

    https://bitbucket.org/dkuhlman/generateds

There is still one, unrelated issue with this schema and the example
XML instance doc (sample.xml) -- sample.xml contains the following
(after a bit of pretty-printing with xmllint):

    <Extensions>
        <Extension type="LR-Pricing">
            <Price model="CPM" currency="USD" 
source="spotxchange"><![CDATA[1]]></Price>
        </Extension>
        <Extension type="SpotX-Count">
            <total_available><![CDATA[1]]></total_available>
        </Extension>
    </Extensions>

And, that element type is defined by this in vast3_draft.xsd:

    <xs:complexType name ="Extensions_type">
      <xs:sequence>
        <xs:element name="Extension" minOccurs="0" maxOccurs="unbounded">
          <xs:annotation>
            <xs:documentation>Any valid XML may be included in the Extensions 
node</xs:documentation>
          </xs:annotation>
          <xs:complexType>
            <xs:sequence>
              <xs:any minOccurs="0" maxOccurs="unbounded" processContents="lax" 
namespace="##any" />
            </xs:sequence>
            <xs:anyAttribute namespace="##any" processContents="lax" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>

But, generateDS.py cannot handle the xs:any.  It needs to know what
type of element it is so that it can build it using the Python class
that was generated from that type definition.  The following section
of the documentation might give you some help with handling that:

  http://www.davekuhlman.org/generateDS.html#support-for-xs-any

Let me know whether this patch works for you and whether you find
additional problems.

Dave

> 
> 
> On Mon, Apr 27, 2015 at 1:10 PM, Adrian Cook <adr...@wildfire.com.au> wrote:
> 
> > Hi Dave,
> >
> > Thanks so much for the update.
> >
> > I've built a new parser using the supplied patch.
> >
> > For the simple case it's working fine, for more complex XSDs it seems to
> > be hit and miss as to whether the CDATA is held over from parse to export.
> >
> > I've put together a zip of a test case and emailed it to you separately.
> >
> > There are simple scripts to create the parser and run it, the output of
> > the run is the result of an export, you can then compare that to the input.
> >
> > I've not been able to identify exactly what the issue is but I'll keep
> > looking at it.
> >
> > Cheers
> >
> > Adrian
> >
> >
> >
> > On Sun, Apr 26, 2015 at 1:07 PM, Dave Kuhlman <dkuhl...@davekuhlman.org>
> > wrote:
> >
> >> On Fri, Apr 24, 2015 at 04:32:17PM +1000, Adrian Cook wrote:
> >> > Hi Dave Kuhlman and the list,
> >> >
> >> > Thanks so much for generateDS, I've been using a generateDS parser
> >> > successfully on millions of XML files.
> >> >
> >> > I have a question regarding CDATA.
> >> >
> >> > I'm developing in an environment where I am processing thousands of XML
> >> > files per minute and I've been using generateDS to create the parser for
> >> > processing. These files all contain a lot of CDATA and are from third
> >> > parties.
> >> >
> >> > It's been fine up to now because I am only using generateDS to parse the
> >> > xml and make decisions based on that. I now have a requirement to mutate
> >> > the data loaded using parser and export new XML.
> >> >
> >> > What I am finding is that the CDATA start and end markup is lost from
> >> the
> >> > exported text .
> >> >
> >> > I've pasted an example at the bottom. This is all pretty much vanilla
> >> use
> >> > of generateDS, parse and export using some of the unit test XSD and XML
> >> > files.
> >> >
> >> > I've read through the list archive and noted a correspondence where
> >> this is
> >> > mentioned
> >> >
> >> > Are there any plans or approaches to address this in generateDS? I can
> >> see
> >> > some comments in the list archive where this issue is mentioned and that
> >> > also indicate CDATA is a poor decision and should be avoided,
> >> unfortunately
> >> > I cannot change the dependence of my system on CDATA.
> >>
> >> Adrian,
> >>
> >> Good to hear from you.  I'm glad that generateDS has been useful.
> >>
> >> Short story first --
> >>
> >> I've patched so that it has support for this.  Specifically, if you
> >> run generateDS.py with the new command line option
> >> "--preserve-cdata-tags", then the generated code will preserve the
> >> CDATA tags, that is the resulting string values will contain
> >> "<![CDATA[" and "]]>".  And, if you do *not* use the
> >> "--preserve-cdata-tags" command line option, then the behavior is
> >> unchanged.
> >>
> >> I've attached a patched version of generateDS.py in a separate
> >> email.  Please let me know if this does what you expect and need.
> >>
> >> I'm going on vacation for 3 days next week.  I've got a chance to go
> >> car campling on the north California, USA coast near Ft. Bragg.
> >> But, I'll look into this some more when I return.
> >>
> >> In the meantime, thanks for reporting this.
> >>
> >> And now, the long story -- You can ignore the following unless you
> >> want to learn more (maybe) about CDATA and my thinking while trying
> >> to work my way through this.
> >>
> >> So, let's try to be (pedantically) specific about what the problem
> >> is:
> >>
> >> 1. generateDS handles CDATA on import/parsing (actually lxml does
> >>    this for us).  Good.
> >>
> >> 2. generateDS handles text on output/export even when there are
> >>    special characters in CDATA sections by escaping those special
> >>    characters as XML entities (e.g. "&lt;").  Good.
> >>
> >> 3. generateDS does *not* preserve CDATA sections on output/export.
> >>    Bad for some applications.
> >>
> >> There are difficulties with handling item 3, above.  Lxml normally
> >> throws away the CDATA tags when it parses a document.  I thought
> >> there was no way around this.  However, while thinking about your
> >> question, this morning, I decided to do one more Web search.
> >> Actually, George David, another list member who has done some work
> >> on this, had earlier pointing me at this ability, but I did not read
> >> carefully enough.  Anyway, I found that there is a way to preserve
> >> those CDATA tags by creating a special parser:
> >>
> >>     from lxml import etree
> >>
> >>     def test():
> >>         p = etree.XMLParser(strip_cdata=False)
> >>         d1 = etree.parse('test01.xml', parser=p)
> >>         r1 = d1.getroot()
> >>         print etree.tostring(r1)
> >>
> >>     test()
> >>
> >> That would seem to suggest that all we have to do in the generated
> >> code is to create a special "strip_cdata=False" parser, which would
> >> be a simple 1 or 2 line change.  But ...
> >>
> >> There is still a problem.  The only way to get the text with the
> >> CDATA tags included is to serialize the element, and when you do so,
> >> you get the surrounding XML tags as well.  For example, with the
> >> sample data that you include below, when you do this:
> >>
> >>     etree.tostring(element)
> >>
> >> we'd get something like:
> >>
> >>     <script><![CDATA[ccc < ddd & eee]]></script>
> >>
> >> So, in order to capture the text *with* CDATA tags, we'd have to do
> >> something like the following:
> >>
> >>     value1 = etree.tostring(element).strip()
> >>     mo = re.search(r'^<.+?>(.*)<.+>$', value1)
> >>     value2 = mo.group(1)
> >>
> >> Ick.  Maybe even: Yuck.
> >>
> >> OK.  I'm over-reacting.  And, it can be made prettier by
> >> pre-compiling the regular expression.
> >>
> >> I'd rather not make that general change, since this feature is
> >> seldom needed, I believe.  Most users will not want to deal with the
> >> <![CDATA[" and "]]>".
> >>
> >> We could add (yet) another command line option to turn on this special
> >> behavior.
> >>
> >> OK.  I gave it a shot.  Added the command line option
> >> ("--preserve-cdata-tags").  Seems to work, but definitely needs more
> >> testing.  I've attached a patched version of generateDS.py (in a
> >> separate email so as not to shove a large email into the list).
> >>
> >> Memo to Dave -- From now on, do less whining, and write more code.
> >> Although, it is a good idea to think these things through, first.
> >>
> >> Let me know if you really do need this behavior.  Also, let me know
> >> if I've really implemented the behavior that you need.  Then, I'll
> >> work on it a bit more, do more testing, create a unit test, etc.
> >>
> >> And, you can find more information about handling CDATA sections
> >> here:
> >>
> >> - http://lxml.de/api.html#cdata
> >>
> >> - http://lxml.de/parsing.html#parser-options
> >>
> >> - http://lxml.de/FAQ.html#parsing-and-serialisation
> >>
> >> -
> >> https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html
> >>
> >>   This comment at the end of the above email thread:
> >>
> >>      "I wouldn't bother. CDATA[] is more of a convenience
> >>       work-around when you are manually editing XML. In generated
> >>       XML, it's not very useful."
> >>
> >>   should be a caution to us not to get too enthusiastic about using
> >>   CDATA section, although it sounds like in your case, it's needed.
> >>
> >> Sorry, for being so wordy.  I needed to get myself to think this
> >> through.
> >>
> >> Dave
> >>
> >> >
> >> > Thanks in advance for any pointers.
> >> >
> >> > Adrian Cook
> >> >
> >> > Source XML:
> >> >
> >> > <cdataListType>
> >> >     <cdatalist>
> >> >         <script><![CDATA[ccc < ddd & eee]]></script>
> >> >     </cdatalist>
> >> >     <cdatalist>
> >> >         <script>aaa &lt; bbb <![CDATA[ccc < ddd]]> eee &lt; &amp;
> >> > fff&lt;<![CDATA[ggg < & hhh]]>&amp; iii &lt; jjj</script>
> >> >     </cdatalist>
> >> > </cdataListType>
> >> >
> >> > After export:
> >> >
> >> > <cdataListType>
> >> >     <cdatalist>
> >> >         <script>ccc &lt; ddd &amp; eee</script>
> >> >     </cdatalist>
> >> >     <cdatalist>
> >> >         <script>aaa &lt; bbb ccc &lt; ddd eee &lt; &amp; fff&lt;ggg &lt;
> >> > &amp; hhh&amp; iii &lt; jjj</script>
> >> >     </cdatalist>
> >> > </cdataListType>
> >>
> >> --
> >>
> >> Dave Kuhlman
> >> http://www.davekuhlman.org
> >>
> >>
> >
-- 

Dave Kuhlman
http://www.davekuhlman.org
diff -r 049779b81747 generateDS.py
--- a/generateDS.py     Sat Apr 25 08:37:06 2015 -0700
+++ b/generateDS.py     Tue May 05 21:19:00 2015 -0700
@@ -5207,16 +5207,7 @@
             return '\"\"\"%%s\"\"\"' %% s1
 
 
-def get_all_text_(node):
-    if node.text is not None:
-        text = node.text
-    else:
-        text = ''
-    for child in node:
-        if child.tail is not None:
-            text += child.tail
-    return text
-
+%s
 
 def find_attr_value_(attr_name, node):
     attrs = node.attrib
@@ -5409,6 +5400,40 @@
     return options1, args1, command_line
 
 
+Preserve_cdata_get_all_text1 = """\
+PRESERVE_CDATA_TAGS_PAT1 = re_.compile(r'^<.+?>(.*?)</?[a-zA-Z0-9\-]+>.*$')
+PRESERVE_CDATA_TAGS_PAT2 = re_.compile(r'^<.+?>.*?</.+?>(.*)$')
+
+
+def get_all_text_(node):
+    if node.text is not None:
+        mo_ = PRESERVE_CDATA_TAGS_PAT1.search(etree_.tostring(node).strip())
+        if mo_ is not None:
+            text = mo_.group(1)
+    else:
+        text = ''
+    for child in node:
+        if child.tail is not None:
+            mo_ = PRESERVE_CDATA_TAGS_PAT2.search(
+                etree_.tostring(child).strip())
+            if mo_ is not None:
+                text += mo_.group(1)
+    return text
+"""
+
+Preserve_cdata_get_all_text2 = """\
+def get_all_text_(node):
+    if node.text is not None:
+        text = node.text
+    else:
+        text = ''
+    for child in node:
+        if child.tail is not None:
+            text += child.tail
+    return text
+"""
+
+
 def generateHeader(wrt, prefix, options, args, externalImports):
     tstamp = (not NoDates and time.ctime()) or ''
     if NoVersion:
@@ -5419,15 +5444,18 @@
     current_working_directory = os.path.split(os.getcwd())[1]
     if PreserveCdataTags:
         preserve_cdata_tags_pat = \
-            "PRESERVE_CDATA_TAGS_PAT = re_.compile(r'^<.+?>(.*)<.+>$')\n"
+            "PRESERVE_CDATA_TAGS_PAT = re_.compile(r'^<.+?>(.*)<.+>$')\n\n"
+        preserve_cdata_get_text = Preserve_cdata_get_all_text1
     else:
         preserve_cdata_tags_pat = ""
+        preserve_cdata_get_text = Preserve_cdata_get_all_text2
     s1 = TEMPLATE_HEADER % (
         tstamp, version,
         options1, args1,
         command_line, current_working_directory,
         ExternalEncoding,
         preserve_cdata_tags_pat,
+        preserve_cdata_get_text,
     )
     wrt(s1)
     for externalImport in externalImports:
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Reply via email to