On Mon, Apr 27, 2015 at 01:10:03PM +1000, Adrian Cook wrote:
>    Hi Dave,
>    Thanks so much for the update.A
>    I've built a new parser using the supplied patch.A
>    For the simple case it's working fine, for more complex XSDs it seems
>    to be hit and miss as to whether the CDATA is held over from parse to
>    export.
>    I've put together a zip of a test case and emailed it to you
>    separately.
>    There are simple scripts to create the parser and run it, the output of
>    the run is the result of an export, you can then compare that to the
>    input.
>    I've not been able to identify exactly what the issue is but I'll keep
>    looking at it.

Adrian,

Thanks for the test files that I can work with.  As I mentioned,
I'll be out of town for most of the week.  I'll get on this when I
return on Friday.  If you find something that you think is a clue,
let me know.  There is a good chance that I have not covered all the
possible cases, yet.  For example, there may be an additional
simpleType that we need to handle.

Dave

>    Cheers
>    Adrian
> 
>    On Sun, Apr 26, 2015 at 1:07 PM, Dave Kuhlman
>    <[1]dkuhl...@davekuhlman.org> wrote:
> 
>      On Fri, Apr 24, 2015 at 04:32:17PM +1000, Adrian Cook wrote:
>      > Hi Dave Kuhlman and the list,
>      >
>      > Thanks so much for generateDS, I've been using a generateDS parser
>      > successfully on millions of XML files.
>      >
>      > I have a question regarding CDATA.
>      >
>      > I'm developing in an environment where I am processing thousands
>      of XML
>      > files per minute and I've been using generateDS to create the
>      parser for
>      > processing. These files all contain a lot of CDATA and are from
>      third
>      > parties.
>      >
>      > It's been fine up to now because I am only using generateDS to
>      parse the
>      > xml and make decisions based on that. I now have a requirement to
>      mutate
>      > the data loaded using parser and export new XML.
>      >
>      > What I am finding is that the CDATA start and end markup is lost
>      from the
>      > exported text .
>      >
>      > I've pasted an example at the bottom. This is all pretty much
>      vanilla use
>      > of generateDS, parse and export using some of the unit test XSD
>      and XML
>      > files.
>      >
>      > I've read through the list archive and noted a correspondence
>      where this is
>      > mentioned
>      >
>      > Are there any plans or approaches to address this in generateDS? I
>      can see
>      > some comments in the list archive where this issue is mentioned
>      and that
>      > also indicate CDATA is a poor decision and should be avoided,
>      unfortunately
>      > I cannot change the dependence of my system on CDATA.
>      Adrian,
>      Good to hear from you.A  I'm glad that generateDS has been useful.
>      Short story first --
>      I've patched so that it has support for this.A  Specifically, if you
>      run generateDS.py with the new command line option
>      "--preserve-cdata-tags", then the generated code will preserve the
>      CDATA tags, that is the resulting string values will contain
>      "<![CDATA[" and "]]>".A  And, if you do *not* use the
>      "--preserve-cdata-tags" command line option, then the behavior is
>      unchanged.
>      I've attached a patched version of generateDS.py in a separate
>      email.A  Please let me know if this does what you expect and need.
>      I'm going on vacation for 3 days next week.A  I've got a chance to
>      go
>      car campling on the north California, USA coast near Ft. Bragg.
>      But, I'll look into this some more when I return.
>      In the meantime, thanks for reporting this.
>      And now, the long story -- You can ignore the following unless you
>      want to learn more (maybe) about CDATA and my thinking while trying
>      to work my way through this.
>      So, let's try to be (pedantically) specific about what the problem
>      is:
>      1. generateDS handles CDATA on import/parsing (actually lxml does
>      A  A this for us).A  Good.
>      2. generateDS handles text on output/export even when there are
>      A  A special characters in CDATA sections by escaping those special
>      A  A characters as XML entities (e.g. "&lt;").A  Good.
>      3. generateDS does *not* preserve CDATA sections on output/export.
>      A  A Bad for some applications.
>      There are difficulties with handling item 3, above.A  Lxml normally
>      throws away the CDATA tags when it parses a document.A  I thought
>      there was no way around this.A  However, while thinking about your
>      question, this morning, I decided to do one more Web search.
>      Actually, George David, another list member who has done some work
>      on this, had earlier pointing me at this ability, but I did not read
>      carefully enough.A  Anyway, I found that there is a way to preserve
>      those CDATA tags by creating a special parser:
>      A  A  from lxml import etree
>      A  A  def test():
>      A  A  A  A  p = etree.XMLParser(strip_cdata=False)
>      A  A  A  A  d1 = etree.parse('test01.xml', parser=p)
>      A  A  A  A  r1 = d1.getroot()
>      A  A  A  A  print etree.tostring(r1)
>      A  A  test()
>      That would seem to suggest that all we have to do in the generated
>      code is to create a special "strip_cdata=False" parser, which would
>      be a simple 1 or 2 line change.A  But ...
>      There is still a problem.A  The only way to get the text with the
>      CDATA tags included is to serialize the element, and when you do so,
>      you get the surrounding XML tags as well.A  For example, with the
>      sample data that you include below, when you do this:
>      A  A  etree.tostring(element)
>      we'd get something like:
>      A  A  <script><![CDATA[ccc < ddd & eee]]></script>
>      So, in order to capture the text *with* CDATA tags, we'd have to do
>      something like the following:
>      A  A  value1 = etree.tostring(element).strip()
>      A  A  mo = re.search(r'^<.+?>(.*)<.+>$', value1)
>      A  A  value2 = mo.group(1)
>      Ick.A  Maybe even: Yuck.
>      OK.A  I'm over-reacting.A  And, it can be made prettier by
>      pre-compiling the regular expression.
>      I'd rather not make that general change, since this feature is
>      seldom needed, I believe.A  Most users will not want to deal with
>      the
>      <![CDATA[" and "]]>".
>      We could add (yet) another command line option to turn on this
>      special
>      behavior.
>      OK.A  I gave it a shot.A  Added the command line option
>      ("--preserve-cdata-tags").A  Seems to work, but definitely needs
>      more
>      testing.A  I've attached a patched version of generateDS.py (in a
>      separate email so as not to shove a large email into the list).
>      Memo to Dave -- From now on, do less whining, and write more code.
>      Although, it is a good idea to think these things through, first.
>      Let me know if you really do need this behavior.A  Also, let me know
>      if I've really implemented the behavior that you need.A  Then, I'll
>      work on it a bit more, do more testing, create a unit test, etc.
>      And, you can find more information about handling CDATA sections
>      here:
>      - [2]http://lxml.de/api.html#cdata
>      - [3]http://lxml.de/parsing.html#parser-options
>      - [4]http://lxml.de/FAQ.html#parsing-and-serialisation
>      -
>      [5]https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February
>      /007409.html
>      A  This comment at the end of the above email thread:
>      A  A  A "I wouldn't bother. CDATA[] is more of a convenience
>      A  A  A  work-around when you are manually editing XML. In generated
>      A  A  A  XML, it's not very useful."
>      A  should be a caution to us not to get too enthusiastic about using
>      A  CDATA section, although it sounds like in your case, it's needed.
>      Sorry, for being so wordy.A  I needed to get myself to think this
>      through.
>      Dave
> 
>    >
>    > Thanks in advance for any pointers.
>    >
>    > Adrian Cook
>    >
>    > Source XML:
>    >
>    > <cdataListType>
>    >A  A  A <cdatalist>
>    >A  A  A  A  A <script><![CDATA[ccc < ddd & eee]]></script>
>    >A  A  A </cdatalist>
>    >A  A  A <cdatalist>
>    >A  A  A  A  A <script>aaa &lt; bbb <![CDATA[ccc < ddd]]> eee &lt;
>    &amp;
>    > fff&lt;<![CDATA[ggg < & hhh]]>&amp; iii &lt; jjj</script>
>    >A  A  A </cdatalist>
>    > </cdataListType>
>    >
>    > After export:
>    >
>    > <cdataListType>
>    >A  A  A <cdatalist>
>    >A  A  A  A  A <script>ccc &lt; ddd &amp; eee</script>
>    >A  A  A </cdatalist>
>    >A  A  A <cdatalist>
>    >A  A  A  A  A <script>aaa &lt; bbb ccc &lt; ddd eee &lt; &amp;
>    fff&lt;ggg &lt;
>    > &amp; hhh&amp; iii &lt; jjj</script>
>    >A  A  A </cdatalist>
>    > </cdataListType>
> 
>      --
>      Dave Kuhlman
>      [6]http://www.davekuhlman.org
> 
> References
> 
>    1. mailto:dkuhl...@davekuhlman.org
>    2. http://lxml.de/api.html#cdata
>    3. http://lxml.de/parsing.html#parser-options
>    4. http://lxml.de/FAQ.html#parsing-and-serialisation
>    5. 
> https://mailman-mail5.webfaction.com/pipermail/lxml/2015-February/007409.html
>    6. http://www.davekuhlman.org/

-- 

Dave Kuhlman
http://www.davekuhlman.org

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Reply via email to