Tibor Simko wrote:
On Thu, 22 Apr 2010, Victor Engmark wrote:
from invenio.textutils import encode_for_xml
encode_for_xml('<subfield code="a">Learn about &lt;![CDATA[ lalala ]]&gt; blocks. 
These are great!</subfield>')
'&lt;subfield code="a"&gt;Learn about &amp;lt;![CDATA[ lalala ]]&amp;gt; blocks. These 
are great!&lt;/subfield&gt;'
That should do it, no?

Let's test the full upload-index-download cycle for a real-life test ADS
record.  Dunno what CDATA contains in Benoit's use case, and whether it
is wanted to store it `as is' in Invenio DB tables.  For example, I'd
say they should rather be stored dereferenced, whenever possible,
similarly as we dereference \u03A3 into Σ during upload; this makes
indexing etc simpler.  But maybe the real-life use case is different.

After browsing through the records that cause problems, I found 3 cases:
* full CDATA element (<![CDATA[ lalala ]]>)
* opening CDATA only (<![CDATA[)
* ending CDATA only (]]>)

All of these cases can be using the ASCII or unicode gt and lt characters (<, >, 〈 and 〉).

I found these cases in title, abstract and references tags.

Benoit, can you please send a concrete MARCXML snippet to be loaded, and
specify how you would like to see it stored/indexed/exported?
Alternatively, please just take the XML encoding bits from Victor's
public branch (vengmark/webtag), and test on your own end.

It seems that the desired behavior would be to make sure that no XML error would be created, not necessarily removing the CDATA XML syntax. In order to remove the CDATA elements, Carolyn needs to reason on a per-case basis.

Benoit.
<record>
  <datafield tag="980" ind1="" ind2="">
    <subfield code="a">ASTRONOMY</subfield>
  </datafield>
  <datafield tag="980" ind1="" ind2="">
    <subfield code="a">REFEREED</subfield>
  </datafield>
  <datafield tag="980" ind1="" ind2="">
    <subfield code="a">ARTICLE</subfield>
  </datafield>
  <datafield tag="690" ind1="C" ind2="">
    <subfield code="a">ARTICLE</subfield>
  </datafield>
  <datafield tag="970" ind1="" ind2="">
    <subfield code="a">2009A&amp;A...499....1F</subfield>
  </datafield>
  <datafield tag="037" ind1="" ind2="">
    <subfield code="a">2009A&amp;A...499....1F</subfield>
    <subfield code="9">ADS bibcode</subfield>
  </datafield>
  <datafield tag="037" ind1="" ind2="">
    <subfield code="a">10.1051/0004-6361/200811055</subfield>
    <subfield code="9">DOI</subfield>
  </datafield>
  <datafield tag="037" ind1="" ind2="">
    <subfield code="a">arXiv:0809.5129</subfield>
    <subfield code="9">arXiv</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
      <subfield code="a">The neutrino signal from protoneutron star accretion and black hole formation with some special <![CDATA[markups]]> and some of the are <![CDATA[malformed.</subfield>
  </datafield>
  <datafield tag="100" ind1="" ind2="">
    <subfield code="a">Fischer, T.</subfield>
    <subfield code="b">Fischer, T</subfield>
    <subfield code="t">regular</subfield>
    <subfield code="u">Department of Physics, University of Basel, Klingelbergstrasse 82, 4056 Basel, Switzerland</subfield>
  </datafield>
  <datafield tag="700" ind1="" ind2="">
    <subfield code="a">Whitehouse, S. C.</subfield>
    <subfield code="b">Whitehouse, S</subfield>
    <subfield code="t">regular</subfield>
    <subfield code="u">Department of Physics, University of Basel, Klingelbergstrasse 82, 4056 Basel, Switzerland</subfield>
  </datafield>
  <datafield tag="700" ind1="" ind2="">
    <subfield code="a">Mezzacappa, A.</subfield>
    <subfield code="b">Mezzacappa, A</subfield>
    <subfield code="t">regular</subfield>
    <subfield code="u">Physics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831-1200, USA</subfield>
  </datafield>
  <datafield tag="700" ind1="" ind2="">
    <subfield code="a">Thielemann, F.-K.</subfield>
    <subfield code="b">Thielemann, F</subfield>
    <subfield code="t">regular</subfield>
    <subfield code="u">Department of Physics, University of Basel, Klingelbergstrasse 82, 4056 Basel, Switzerland</subfield>
  </datafield>
  <datafield tag="700" ind1="" ind2="">
    <subfield code="a">Liebendörfer, M.</subfield>
    <subfield code="b">Liebendoerfer, M</subfield>
    <subfield code="t">regular</subfield>
    <subfield code="u">Department of Physics, University of Basel, Klingelbergstrasse 82, 4056 Basel, Switzerland</subfield>
  </datafield>
  <datafield tag="773" ind1="" ind2="">
    <subfield code="p">Astronomy and Astrophysics</subfield>
    <subfield code="v">499</subfield>
    <subfield code="c">1-15</subfield>
    <subfield code="y">2009</subfield>
    <subfield code="f">Astronomy and Astrophysics, Volume 499, Issue 1, 2009, pp.1-15</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="4">
    <subfield code="p">Astronomy and Astrophysics</subfield>
    <subfield code="v">499</subfield>
    <subfield code="c">1-15</subfield>
    <subfield code="y">2009</subfield>
  </datafield>
  <datafield tag="269" ind1="" ind2="">
    <subfield code="c">2009-05-00</subfield>
  </datafield>
  <datafield tag="260" ind1="" ind2="">
    <subfield code="c">2009</subfield>
  </datafield>
  <datafield tag="520" ind1="" ind2="">
      <subfield code="a">Context: We discuss the formation of stellar mass black holes via protoneutron star (PNS) collapse. In the absence of an earlier explosion, the PNS collapses to a black hole due to the continued mass accretion onto the PNS. We present an analysis of the emitted neutrino spectra of all three flavors during the PNS contraction.  &lt;BR /&gt;Aims: Special attention is given to the physical conditions which depend on the input physics, e.g. the equation of state (EoS) and the progenitor model. &lt;BR /&gt;Methods: The PNSs are modeled as the central object in core collapse simulations using general relativistic three-flavor Boltzmann neutrino transport in spherical symmetry. The simulations are launched from several massive progenitors of〈![CDATA[(50)]]〉 M〈![CDATA[(_⊙).</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2="">
    <subfield code="u">http://adsabs.harvard.edu/cgi-bin/nph-data_query?bibcode=2009A%26A...499....1F&amp;link_type=PREPRINT</subfield>
    <subfield code="y">arXiv e-print</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2="">
    <subfield code="u">http://adsabs.harvard.edu/cgi-bin/nph-data_query?bibcode=2009A%26A...499....1F&amp;link_type=EJOURNAL</subfield>
    <subfield code="y">Electronic On-line Article (HTML)</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2="">
    <subfield code="u">http://adsabs.harvard.edu/cgi-bin/nph-data_query?bibcode=2009A%26A...499....1F&amp;link_type=SIMBAD</subfield>
    <subfield code="y">SIMBAD Objects</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2="">
    <subfield code="u">http://adsabs.harvard.edu/cgi-bin/nph-data_query?bibcode=2009A%26A...499....1F&amp;link_type=ARTICLE</subfield>
    <subfield code="y">Full Printable Article (PDF/Postscript)</subfield>
  </datafield>
  <datafield tag="907" ind1="" ind2="">
    <subfield code="a">EDP Sciences</subfield>
  </datafield>
  <datafield tag="999" ind1="C" ind2="5">
    <subfield code="o">[1]</subfield>
    <subfield code="m">Messer, O. E. B., &amp; Bruenn, S. W. 2003, private communications</subfield>
  </datafield>
</record>

Reply via email to