There are probably a couple of answers to that.

XML rules define what characterset is used. The "encoding" attribute on
the <?xml?> header is where you find out what characterset is being
used.

I've always gone under the assumption that if an encoding wasn't
specified, then UTF-8 is in effect and that has always worked for me.
It turns out the standard says US-ASCII is the default encoding.

But, ignoring the encoding, the original MarcXML rules were the same as
the MARC-21 rules for character repertoire and you were suppose to
restrict yourself to characters that could be mapped back into MARC-8.
I don't know if that rule is still in force, but everyone ignores it.

I hope that helps!

Ralph

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, April 17, 2012 12:35 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: MarcXML and char encodings

I know how char encodings work in MARC ISO binary -- the encoding can 
legally be either Marc8 or UTF8 (nothing else).  The encoding of a 
record is specified in it's header. In the wild, specified encodings are

frequently wrong, or data includes weird mixed encodings. Okay!

But what's going on with MarcXML?  What are the legal encodings for 
MarcXML?  Only Marc8 and UTF8, or anything that can be expressed in 
XML?  The MARC header is (or can) be present in MarcXML -- trust the 
MARC header, or trust the XML doctype char encoding?

What's the legal thing  to do? What's actually found 'in the wild' with 
MarcXML?

Can anyone advise?

Jonathan

Reply via email to