Re: CDATA in MarcXML

Tibor Simko Thu, 01 Nov 2012 11:46:22 -0700

On Thu, 01 Nov 2012, Giovanni Di Milia wrote:
> There are a bunch of regular expressions that try to deal with this
> problem, but none of them actually take care of the point 4.


I would not like very much to be catching various CDATA peculiarities
via regexp either...  As the famous quote says:

  ``Some people, when confronted with a problem, think "I know, I'll use
    regular expressions." Now they have two problems.''
          -- Jamie Zawinski       

> So my solution is: we simply escape also the ">".

Indeed, if this covers your usage of CDATA, then it may be the easiest
work around to look into this direction.

> I'm not sure how adding a " text = text.replace('>', '&gt;') "  in the
> function "encode_for_xml" impacts the rest of the software

Mostly we should just beware of various double-encoding/double-decoding
troubles.  You can try to add the line, run exporting-and-importing of
certain record a few times, and see.

Note that we already have `&gt;' in incoming XML file, see the Poetry
demo records.  Incoming `&gt;' gets stored as `>' in the record after it
is uploaded.  I guess this behaviour suits your needs as well.  If so,
the input direction should be OK.  Next you can try to run a few
export-import-export cycles and see.

Lastly, if the addition would cause troubles in some places but not in
others, or if we want to be ultra safe, then we can change the signature
of the encode_for_xml() function into something like:

   def encode_for_xml(encode_gt=False)

and call the function with `encode_gt=True' only when needed.

Best regards
-- 
Tibor Simko

Re: CDATA in MarcXML

Reply via email to