Hi Benoit,
Benoit Thiell wrote:
'<![CDATA[ lalala ]]>' is converted to '<![CDATA[ lalala ]]>' and
inserted in the MARCXML representation of the record. The XML parsers
(all 3 supported by Invenio) then refuse to parse because ']]>' is not a
valid string in the content.
Interesting. It would seem that the character sequence "]]>" would need
to be escaped, or else it would be interpreted as the end of a CDATA
section by the parsers, whatever the context (i.e. even if it has
nothing to do with a CDATA block, as in "[2*[5+2]]>10"). Until now we
only escaped `&' and '<', maybe time has come to also escape `>'?
I come here (in peace) because I am unsure about what encode_for_xml()
should do. Escape the CDATA element completely (by transforming ']]>' to
']]>') or leave the CDATA element alone.
For sure CDATA blocks (including their content) should not be escaped,
or else the "<![CDATA[" and "]]>" directives would not be interpreted by
parsers, but would just be seen as part of the text node. Alternatively,
the CDATA markup could be removed, and replaced by its escaped content.
This seems to be the usual behaviour of XML parsers when they encounter
CDATA blocks.
Both solutions are semantically equivalent, but replacing CDATA blocks
by their escaped contents can sometimes be annoying, especially since
these blocks are usually used because they are needed. For eg. to avoid
escaping all `<', `>', and `&' characters when writing XML "by hand"
that contains lots of such values.
Still I am quite surprised that there are references to CDATA blocks in
your bibXXX tables: the 3 XML parsers supported by default in
CDS Invenio seem to replace these blocks by their escaped contents.
You should then only see the "interpreted" content in the bibXXX tables,
and should have no problem with the formatting (This is just an
assumption. I have not checked the real behaviour of
bibupload/bibrecord/bibedit).
In your case, unless the values were inserted manually in the DB, these
are probably not real CDATA blocks, but just "mentions" of CDATA. Let me
explain: let's say I am writing a book about XML and CDATA blocks, and I
use an MARCXML file for that (!). I should escape CDATAs every time I
mention them, so that they are not interpreted by my parser.
For eg:
<subfield code="a">Learn about <![CDATA[ lalala ]]> blocks. These
are great!</subfield>
Once inserted in the bibXXX tables, and formatted, the above sample
should still look escaped. This is unfortunately not the case with the
current version of CDS Invenio. The above value is well stored in a
bibXXX table as "Learn about <![CDATA[ lalala ]]> blocks. These are
great!". But when formatted, the value is:
<subfield code="a">Learn about <![CDATA[ lalala ]]> blocks. These are
great!</subfield>
This raises the problem that you mentioned, leading me to think that
encode_for_xml() should also escape `>' chars.
This also shows that it would be difficult to store real CDATA blocks as
such in the bibXXX tables, since the special chars should be "decoded"
before they are inserted in the bibXXX tables. So if you have "manually"
inserted real CDATA blocks in the DB, you might then want to replace
them by their content. That would also solve the problem of indexing,
which I think is not washing away "CDATA" directives, as it does with
other XML markup. Support for CDATA blocks inside bibXXX table is not
needed in other cases, as BibRecord (through the parsers it uses)
decodes these blocks.
What do other people think? I have the feeling it is easy to escape too
much, or not enough...
Best regards
--
Jerome Caffaro ** CERN Document Server ** <http://cds.cern.ch/>