Hi all (again),
I made some tests and I modified the code of "encode_for_xml" in the
following way and it seem to work fine:

def encode_for_xml(text, wash=False, xml_version='1.0', quote=False):
    """Encodes special characters in a text so that it would be
    XML-compliant.
    @param text: text to encode
    @return: an encoded text"""
    text = text.replace('&', '&')
    text = text.replace('<', '&lt;')
    text = text.replace('>', '&gt;')
    if quote:
        text = text.replace('"', '&quot;')
    if wash:
        text = wash_for_xml(text, xml_version=xml_version)
    return text

I repeat that I don't know why all the XML special characters are not
escaped, but even this solution looks semantically wrong to me,
because it doesn't follow the W3C guidelines:
http://www.w3.org/TR/xml/#syntax

A correct function should escape in this way:
"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

while a CDATA section should not be escaped,
but at least now the XML generated (and stored in bibfmt) is valid.

Thank for your help,
Giovanni


--------------------------------------------------------------
Giovanni Di Milia
IT Specialist at SAO/NASA ADS
Harvard-Smithsonian Center for Astrophysics
60 Garden Street, MS 83
Cambridge, MA 02138 USA
email: [email protected]
--------------------------------------------------------------

On Tue, Oct 30, 2012 at 2:44 PM, Giovanni Di Milia
<[email protected]> wrote:
> Hi all,
> here at ADS we have a problem with some metadata that contain CDATA elements.
> The problem is caused by the export procedure of Invenio that doesn't
> properly encode these elements.
>
> What happens is that all the elements like
> '<![CDATA[ foobar ]]>'
> are converted to
> '&lt;![CDATA[ foobar ]]>'
> and this in XML is an error.
>
> After reading a very similar discussion from 2010 (started by Benoit),
> I suppose that the problem is still in
> invenio.textutils.encode_for_xml()
> which is used in
> bibformat_utils.record_get_xml().
>
> I honestly don't understand why all the tags inside a subflield are
> not escaped (but I suppose there is a good reason) but in case of
> CDATA the tag should be completely escaped.
>
> Thanks for your help,
>
> Giovanni
>
>
>
>
> --------------------------------------------------------------
> Giovanni Di Milia
> IT Specialist at SAO/NASA ADS
> Harvard-Smithsonian Center for Astrophysics
> 60 Garden Street, MS 83
> Cambridge, MA 02138 USA
> email: [email protected]
> --------------------------------------------------------------

Reply via email to