[CODE4LIB] MarcXML and char encodings
I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the ?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
What's the legal thing to do? What's actually found 'in the wild' with MarcXML? In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edu / 503.999.9787
Re: [CODE4LIB] MarcXML and char encodings
So what if the ?xml? decleration says one charset encoding, but the MARC header included in the MarcXML says a different encoding... which one is the 'legal' one to believe? Is it legal to have MarcXML that is not UTF-8 _or_ Marc8, that is an entirely different charset that is legal in XML? If you did that, what should the MARC header included in the XML say? I know how char encodings work in XML. I don't understand what the standards say about how that interacts with the MARC data in MarcXML. Jonathan On 4/17/2012 1:51 PM, LeVan,Ralph wrote: There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
On 4/17/2012 1:57 PM, Kyle Banerjee wrote: In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle So would you use the Marc header payload instead? Or you're just saying you wouldn't trust _any_ encoding declerations you find anywhere? When writing a library to handle marc, I think the base line should be making it do the official legal standards-complaint right thing. Extra heuristics to deal with invalid data can be added on top. But my trouble here is I can't even figure out what the official legal standards-compliant thing is. Maybe that's becuase the MarcXML standard simply doesn't address it, and it's all implementation dependent. sigh. The problem is how the XML documents own char encoding is supposed to interact with the MARC header; especially because there's no way to put Marc8 in an XML char encoding doctype (is there?); and whether encodings other than Marc8 or UTF8 are legal in MarcXML, even though they aren't in MARC ISO binary. I think the answer might be nobody knows, and there is no standard right way to do it. Which is unfortunate.
Re: [CODE4LIB] MarcXML and char encodings
Okay, maybe here's another way to approach the question. If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? On 4/17/2012 1:57 PM, Kyle Banerjee wrote: What's the legal thing to do? What's actually found 'in the wild' with MarcXML? In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle
Re: [CODE4LIB] MarcXML and char encodings
If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? I'd claim this is legal, if it is legal XML. Set your encoding to anything that is valid. As a Java programmer, using java XML tools, the encoding is just a hint to the tools. I end up with Unicode strings after the XML is read. So I always ignore the encoding byte in the leader. Following that logic, that byte is about encoding. It has meaning when ISO 2709 is the transfer mechanism. But, in this case, XML is the transfer mechanism and it's rules for identifying the encoding are what matter. I'm proposing that the encoding byte in the leader is meaningless. Ralph
Re: [CODE4LIB] MarcXML and char encodings
Hi Ralph, But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. That rule no longer applies per the December 2007 revision of the MARC 21 Specifications: To facilitate the movement of records between MARC-8 and Unicode environments, it was recommended for an initial period that the use of Unicode be restricted to a repertoire identical in extent to the MARC-8 repertoire. [...] however, such a restriction is no longer appropriate. The full UCS repertoire, as currently defined at the Unicode web site, is valid for encoding MARC 21 records subject only to the constraints described [in the current MARC 21 Specifications]. -- from MARC 21 Specifications (revised December 2007) [1] -- Michael [1] http://www.loc.gov/marc/specifications/speccharucs.html -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of LeVan,Ralph Sent: Tuesday, April 17, 2012 12:51 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the ?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
Re: But do others agree that there is in fact no legal way to have Marc8 in MarcXML? No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, and you will want to be aware that XML processors are only REQUIRED to process UTF-8 and UTF-16 -- in practice many (including JAVA-based one) can handle other encodings -- but you will have to make sure whatever XML processor you use, in whatever language it is written, has a handy-dandy MARC8 coder/decoder ring Sheila -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 2:46 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
Jonathan Rochkind Sent: Tuesday, April 17, 2012 14:18 Subject: Re: [CODE4LIB] MarcXML and char encodings Okay, maybe here's another way to approach the question. If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? You cannot have a MARC-XML document encoded in MARC-8, well sort of, but it's not standard. To answer your questions you have to refer to a variety of standards: http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncodingDecl In an encoding declaration, the values UTF-8 , UTF-16 , ISO-10646-UCS-2 , and ISO-10646-UCS-4 should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values ISO-8859-1 , ISO-8859-2 , ... ISO-8859- n (where n is the part number) should be used for the parts of ISO 8859, and the values ISO-2022-JP , Shift_JIS , and EUC-JP should be used for the various encoded forms of JIS X-0208-1997. It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an x- prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-! registered encodings). In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration. 1) The above says that ?xml version=1.0 ? means the same as ?xml version=1.0 encoding=utf-8 ? and if you prefer you can omit the XML declaration and that is assumed to be UTF-8 unless there is a BOM (Byte Order Mark) which determines UTF-8 vs UTF-16BE vs UTF-16LE. 2) If you really wanted to encode the XML in MARC-8 you need to specify x- since if you refer to: http://www.iana.org/assignments/character-sets MARC-8 isn't a registered character set, hence cannot be specified in the encoding attribute unless the name was prefixed with x-. Which implies that no standard XML library will know how to convert the MARC-8 characters into Unicode so the XML DOM can be used. So unless you want to write your own MARC-8 = Unicode conversion routines and integrate them your preferred XML library it isn't going to work out of the box for anyone else but yourself. When dealing with MARC-XML you should ignore the values in LDR/00-04, LDR/10, LDR/11, LDR/12-16, LDR/20-23. If you look at the MARC-XML schema you will note that the definition for leaderDataType specifies LDR/00-04 [\d ]{5}, LDR/10 and LDR/11 (2| ), LDR/12-16 [\d ]{5}, LDR/20-23 (4500| ). Note the MARC-XML schema allows spaces in those positions because they are not relevant in the XML format, though very relevant in the binary format. You probably should ignore LDR/09 since most MARC to MARC-XML converters do not change this value to 'a' although many converters do change the value when converting MARC binary between MARC-8 and UTF-8. The only valid character set for MARC-XML is Unicode and it *should* be encoded in UTF-8 in Unicode normalization form D (NFD) although most XML libraries will not know the difference if it was encoded as UTF-16BE or UTF-16LE in Unicode normalization form D since the XML libraries internally work with Unicode. I could have sworn that this information was specified on LC's site at one point in time, but I'm having trouble finding the documentation. Hope this helps, Andy.
Re: [CODE4LIB] MarcXML and char encodings
So would you use the Marc header payload instead? Or you're just saying you wouldn't trust _any_ encoding declerations you find anywhere? This. The short version is that too many vendors and systems just supply some value without making sure that's what they're spitting out. I haven't had to mess with this stuff for a few years, so I'm hoping Terry Reese weighs in on this conversation -- he has a lot of experience dealing with encoding headaches. However, the bottom line is that the most reliable method is to use heuristics to detect what's going on. Yeah, that totally kills the point of listing encodings in first place, but just as is the case with any unreliably used data point, it's all GIGO. When writing a library to handle marc, I think the base line should be making it do the official legal standards-complaint right thing. Extra heuristics to deal with invalid data can be added on top. I'm hoping things have improved, but if heuristics are more reliable than reading the right areas of the record, you have to ignore what's there (which makes even reading it pointless). I do think there is value in encouraging vendors to actually pay attention to this stuff as such basic screwups undermine both the the credibility of the data source and the service that depends on the data. But my trouble here is I can't even figure out what the official legal standards-compliant thing is. Maybe that's becuase the MarcXML standard simply doesn't address it, and it's all implementation dependent. sigh. The problem is how the XML documents own char encoding is supposed to interact with the MARC header; especially because there's no way to put Marc8 in an XML char encoding doctype (is there?); and whether encodings other than Marc8 or UTF8 are legal in MarcXML, even though they aren't in MARC ISO binary. I think the answer might be nobody knows, and there is no standard right way to do it. Which is unfortunate. A good summary of the situation as I understand it. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edu / 503.999.9787
Re: [CODE4LIB] MarcXML and char encodings
The discussions at the MARC standards group relating to Unicode all had to do with using Unicode *within* ISO2709. I can't find any evidence that MARCXML ever went through the standards process. (This may not be a bad thing.) So none of what we know about the MARBI discussions and resulting standards can really help us here, except perhaps by analogy. In LC's own example on the MARCXML page (the Sandburg example) the Leader is copied without change from the ISO2709/MARC-8 record to the MARCXML/Unicode record -- in other words, it still has a blank in offset 09, which means MARC-8. (The XML record is UTF-8.) My gut feeling is that the Leader in MARCXML should be treated like the human appendix -- something that once had a use, but is now just being carried along for historical reasons. I would not expect it to reflect the XML record within which it is embedded. Unfortunately, it is the only source of some key information, like type of record. The more I think about it, the more MARCXML strikes me as a really messed-up format. kc On 4/17/12 11:46 AM, Jonathan Rochkind wrote: Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] MarcXML and char encodings
Karen Coyle Sent: Tuesday, April 17, 2012 15:41 Subject: Re: [CODE4LIB] MarcXML and char encodings The discussions at the MARC standards group relating to Unicode all had to do with using Unicode *within* ISO2709. I can't find any evidence that MARCXML ever went through the standards process. (This may not be a bad thing.) So none of what we know about the MARBI discussions and resulting standards can really help us here, except perhaps by analogy. Well I can confirm that the MARCXML didn't go through MARBI since I was one of OCLC's representatives who solidified MARCXML. MARCXML came out of a meeting at LC between the MARC Standards office, OCLC, RLG, and one or two other interested parties whom I cannot remember or find in my emails or notes about the meeting. Andy.
Re: [CODE4LIB] MarcXML and char encodings
Let me make some recommendations. These are what I would consider best practices for interoperability. 1) Never put marc8 in xml. Just don't do it. No one expects it. Few will be willing to bother with it. 2) Always prefer utf8 for marcxml. You can use any standard charset if you need to, but without special circumstances, use utf8 3) ignore leader 9 in marcxml. Only consider the prolog. (consider not trust.) If you reasonably can, fail when the charset is Wrong. /dev Sent via the Samsung Galaxy S™ II Skyrocket™, an ATT 4G LTE smartphone. Original message Subject: Re: [CODE4LIB] MarcXML and char encodings From: Jonathan Rochkind rochk...@jhu.edu To: CODE4LIB@LISTSERV.ND.EDU CC: Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote: No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? The things that appear there need to be from a specific list, and I didn't think Marc8 was on that list? Can you give me an example? And, if you happen to have it, link to XML standard that says this is legal?
Re: [CODE4LIB] MarcXML and char encodings
No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? Nope, you can't do that. There is no approved name for the MARC-8 encoding. As Andy said, the closest you could get would be to make up an experimental name, like x-marc-8, but no tool in the world would recognize that. Ralph
Re: [CODE4LIB] MarcXML and char encodings
In XML standard: It is RECOMMENDED that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to usingtheir registered names; other encodings SHOULD use names starting with an x- prefix. XML processors SHOULD match character encoding names in a case-insensitive way and SHOULDeither interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA- registered encodings). As I suggested -- since MARC8 isn't (so far as I know) registered -- you won't get far with most standard tools, in whatever language -- you'll have to extend them to first recognize the encoding name, and second, decode the content. smm -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Tuesday, April 17, 2012 4:19 PM To: Code for Libraries Cc: Sheila M. Morrissey Subject: Re: [CODE4LIB] MarcXML and char encodings On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote: No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? The things that appear there need to be from a specific list, and I didn't think Marc8 was on that list? Can you give me an example? And, if you happen to have it, link to XML standard that says this is legal?
Re: [CODE4LIB] MarcXML and char encodings
MARC-8. Cool in its time. Dumb now. Typical. --ELM
Re: [CODE4LIB] MarcXML and char encodings
I think this is a case of being in violent agreement -- see some earlier replies in this thread -- Pragmatically, if you are going to hew to marc-8 encoding transported in XML -- you are losing the usefulness of standard tools for xml -- smm -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of LeVan,Ralph Sent: Tuesday, April 17, 2012 4:21 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? Nope, you can't do that. There is no approved name for the MARC-8 encoding. As Andy said, the closest you could get would be to make up an experimental name, like x-marc-8, but no tool in the world would recognize that. Ralph