[CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind
I know how char encodings work in MARC ISO binary -- the encoding can 
legally be either Marc8 or UTF8 (nothing else).  The encoding of a 
record is specified in it's header. In the wild, specified encodings are 
frequently wrong, or data includes weird mixed encodings. Okay!


But what's going on with MarcXML?  What are the legal encodings for 
MarcXML?  Only Marc8 and UTF8, or anything that can be expressed in 
XML?  The MARC header is (or can) be present in MarcXML -- trust the 
MARC header, or trust the XML doctype char encoding?


What's the legal thing  to do? What's actually found 'in the wild' with 
MarcXML?


Can anyone advise?

Jonathan


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread LeVan,Ralph
There are probably a couple of answers to that.

XML rules define what characterset is used. The encoding attribute on
the ?xml? header is where you find out what characterset is being
used.

I've always gone under the assumption that if an encoding wasn't
specified, then UTF-8 is in effect and that has always worked for me.
It turns out the standard says US-ASCII is the default encoding.

But, ignoring the encoding, the original MarcXML rules were the same as
the MARC-21 rules for character repertoire and you were suppose to
restrict yourself to characters that could be mapped back into MARC-8.
I don't know if that rule is still in force, but everyone ignores it.

I hope that helps!

Ralph

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, April 17, 2012 12:35 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: MarcXML and char encodings

I know how char encodings work in MARC ISO binary -- the encoding can 
legally be either Marc8 or UTF8 (nothing else).  The encoding of a 
record is specified in it's header. In the wild, specified encodings are

frequently wrong, or data includes weird mixed encodings. Okay!

But what's going on with MarcXML?  What are the legal encodings for 
MarcXML?  Only Marc8 and UTF8, or anything that can be expressed in 
XML?  The MARC header is (or can) be present in MarcXML -- trust the 
MARC header, or trust the XML doctype char encoding?

What's the legal thing  to do? What's actually found 'in the wild' with 
MarcXML?

Can anyone advise?

Jonathan


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Kyle Banerjee
 What's the legal thing  to do? What's actually found 'in the wild' with
 MarcXML?


In some cases, invalid XML.

In an ideal world, the encoding should be included in the declaration. But
I wouldn't trust it.

kyle


-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edu / 503.999.9787


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind
So what if the ?xml? decleration says one charset encoding, but the 
MARC header included in the MarcXML says a different encoding... which 
one is the 'legal' one to believe?


Is it legal to have MarcXML that is not UTF-8 _or_ Marc8, that is an 
entirely different charset that is legal in XML?  If you did that, what 
should the MARC header included in the XML say?


I know how char encodings work in XML.  I don't understand what the 
standards say about how that interacts with the MARC data in MarcXML.


Jonathan

On 4/17/2012 1:51 PM, LeVan,Ralph wrote:

There are probably a couple of answers to that.

XML rules define what characterset is used. The encoding attribute on
the?xml?  header is where you find out what characterset is being
used.

I've always gone under the assumption that if an encoding wasn't
specified, then UTF-8 is in effect and that has always worked for me.
It turns out the standard says US-ASCII is the default encoding.

But, ignoring the encoding, the original MarcXML rules were the same as
the MARC-21 rules for character repertoire and you were suppose to
restrict yourself to characters that could be mapped back into MARC-8.
I don't know if that rule is still in force, but everyone ignores it.

I hope that helps!

Ralph

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, April 17, 2012 12:35 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: MarcXML and char encodings

I know how char encodings work in MARC ISO binary -- the encoding can
legally be either Marc8 or UTF8 (nothing else).  The encoding of a
record is specified in it's header. In the wild, specified encodings are

frequently wrong, or data includes weird mixed encodings. Okay!

But what's going on with MarcXML?  What are the legal encodings for
MarcXML?  Only Marc8 and UTF8, or anything that can be expressed in
XML?  The MARC header is (or can) be present in MarcXML -- trust the
MARC header, or trust the XML doctype char encoding?

What's the legal thing  to do? What's actually found 'in the wild' with
MarcXML?

Can anyone advise?

Jonathan



Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind

On 4/17/2012 1:57 PM, Kyle Banerjee wrote:

In some cases, invalid XML. In an ideal world, the encoding should be 
included in the declaration. But I wouldn't trust it. kyle 


So would you use the Marc header payload instead?

Or you're just saying you wouldn't trust _any_ encoding declerations you 
find anywhere?


When writing a library to handle marc, I think the base line should be 
making it do the official legal standards-complaint right thing.  Extra 
heuristics to deal with invalid data can be added on top.


But my trouble here is I can't even figure out what the official legal 
standards-compliant thing is.


Maybe that's becuase the MarcXML standard simply doesn't address it, and 
it's all implementation dependent. sigh.


The problem is how the XML documents own char encoding is supposed to 
interact with the MARC header; especially because there's no way to put 
Marc8 in an XML char encoding doctype (is there?);  and whether 
encodings other than Marc8 or UTF8 are legal in MarcXML, even though 
they aren't in MARC ISO binary.


I think the answer might be nobody knows, and there is no standard 
right way to do it. Which is unfortunate.


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind

Okay, maybe here's another way to approach the question.

If I want to have a MarcXML document encoded in Marc8 -- what should it 
look like?  What should be in the XML decleration? What should be in the 
MARC header embedded in the XML?  Or is it not in fact legal at all?


If I want to have a MarcXML document encoded in UTF8, what should it 
look like? What should be in the XML decleration? What should be in the 
MARC header embedded in the XML?


If I want to have a MarcXML document with a char encoding that is 
_neither_ Marc8 nor UTF8, but something else generally legal for XML -- 
is this legal at all? And if so, what should it look like? What should 
be in the XML decleration? What should be in the MARC header embedded in 
the XML?


On 4/17/2012 1:57 PM, Kyle Banerjee wrote:

What's the legal thing  to do? What's actually found 'in the wild' with
MarcXML?


In some cases, invalid XML.

In an ideal world, the encoding should be included in the declaration. But
I wouldn't trust it.

kyle




Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread LeVan,Ralph
 If I want to have a MarcXML document encoded in Marc8 -- what should
it 
 look like?  What should be in the XML decleration? What should be in
the 
 MARC header embedded in the XML?  Or is it not in fact legal at all?

I'm going out on a limb here, but I don't think it is legal.  There is
no formal encoding that corresponds to MARC-8, so there's no way to tell
XML tools how to interpret the bytes.


 If I want to have a MarcXML document encoded in UTF8, what should it 
 look like? What should be in the XML decleration? What should be in
the 
 MARC header embedded in the XML?

?xml encoding=UTF-8?

I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
really matter to any XML tools.


 If I want to have a MarcXML document with a char encoding that is 
 _neither_ Marc8 nor UTF8, but something else generally legal for XML
-- 
 is this legal at all? And if so, what should it look like? What should

 be in the XML decleration? What should be in the MARC header embedded
in 
 the XML?

I'd claim this is legal, if it is legal XML.  Set your encoding to
anything that is valid.

As a Java programmer, using java XML tools, the encoding is just a hint
to the tools.  I end up with Unicode strings after the XML is read.  So
I always ignore the encoding byte in the leader.

Following that logic, that byte is about encoding.  It has meaning when
ISO 2709 is the transfer mechanism.  But, in this case, XML is the
transfer mechanism and it's rules for identifying the encoding are what
matter.  I'm proposing that the encoding byte in the leader is
meaningless.

Ralph


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Doran, Michael D
Hi Ralph,

 But, ignoring the encoding, the original MarcXML rules were the same as
 the MARC-21 rules for character repertoire and you were suppose to
 restrict yourself to characters that could be mapped back into MARC-8.
 I don't know if that rule is still in force, but everyone ignores it.

That rule no longer applies per the December 2007 revision of the MARC 21 
Specifications:

To facilitate the movement of records between MARC-8 
and Unicode environments, it was recommended for an 
initial period that the use of Unicode be restricted 
to a repertoire identical in extent to the MARC-8 
repertoire. [...] however, such a restriction is no 
longer appropriate. The full UCS repertoire, as currently 
defined at the Unicode web site, is valid for encoding 
MARC 21 records subject only to the constraints described 
[in the current MARC 21 Specifications].

-- from MARC 21 Specifications (revised December 2007) [1]

-- Michael

[1] http://www.loc.gov/marc/specifications/speccharucs.html

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 LeVan,Ralph
 Sent: Tuesday, April 17, 2012 12:51 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MarcXML and char encodings
 
 There are probably a couple of answers to that.
 
 XML rules define what characterset is used. The encoding attribute on
 the ?xml? header is where you find out what characterset is being
 used.
 
 I've always gone under the assumption that if an encoding wasn't
 specified, then UTF-8 is in effect and that has always worked for me.
 It turns out the standard says US-ASCII is the default encoding.
 
 But, ignoring the encoding, the original MarcXML rules were the same as
 the MARC-21 rules for character repertoire and you were suppose to
 restrict yourself to characters that could be mapped back into MARC-8.
 I don't know if that rule is still in force, but everyone ignores it.
 
 I hope that helps!
 
 Ralph
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Tuesday, April 17, 2012 12:35 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: MarcXML and char encodings
 
 I know how char encodings work in MARC ISO binary -- the encoding can
 legally be either Marc8 or UTF8 (nothing else).  The encoding of a
 record is specified in it's header. In the wild, specified encodings are
 
 frequently wrong, or data includes weird mixed encodings. Okay!
 
 But what's going on with MarcXML?  What are the legal encodings for
 MarcXML?  Only Marc8 and UTF8, or anything that can be expressed in
 XML?  The MARC header is (or can) be present in MarcXML -- trust the
 MARC header, or trust the XML doctype char encoding?
 
 What's the legal thing  to do? What's actually found 'in the wild' with
 MarcXML?
 
 Can anyone advise?
 
 Jonathan


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind

Thanks, this is helpful feedback at least.

I think it's completely irrelevant, when determining what is legal under 
standards, to talk about what certain Java tools happen to do though, I 
don't care too much what some tool you happen to use does.


In this case, I'm _writing_ the tools. I want to make them do 'the right 
thing', with some mix of what's actually official legally correct and 
what's practically useful.  What your Java tools do is more or less 
irrelevant to me. I certainly _could_ make my tool respect the Marc 
leader encoded in MarcXML over the XML decleration if I wanted to. I 
could even make it assume the data is Marc8 in XML, even though there's 
no XML charset type for it, if the leader says it's Marc8.


But do others agree that there is in fact no legal way to have Marc8 in 
MarcXML?


Do others agree that you can use non-UTF8 encodings in MarcXML, so long 
as they are legal XML?


I won't even ask someone to cite standards documents, because it's 
pretty clear that LC forgot to consider this when establishing MarcXML.  
(And I have no faith that one could get LC to make a call on this and 
publish it any time this century).


Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How 
is it represented with regard to the XML leader and the Marc header?


Has anyone seen any MarcXML with char encodings that are neither Marc8 
nor UTF8 in the wild? Are they common? How are they represented with 
regard to XML leader and Marc header?


On 4/17/2012 2:32 PM, LeVan,Ralph wrote:

If I want to have a MarcXML document encoded in Marc8 -- what should

it

look like?  What should be in the XML decleration? What should be in

the

MARC header embedded in the XML?  Or is it not in fact legal at all?

I'm going out on a limb here, but I don't think it is legal.  There is
no formal encoding that corresponds to MARC-8, so there's no way to tell
XML tools how to interpret the bytes.



If I want to have a MarcXML document encoded in UTF8, what should it
look like? What should be in the XML decleration? What should be in

the

MARC header embedded in the XML?

?xml encoding=UTF-8?

I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
really matter to any XML tools.



If I want to have a MarcXML document with a char encoding that is
_neither_ Marc8 nor UTF8, but something else generally legal for XML


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey
Re: But do others agree that there is in fact no legal way to have Marc8 in 
MarcXML?

No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in 
the XML prolog, and you will want to be aware that XML processors are only 
REQUIRED to process UTF-8 and UTF-16 -- in practice many (including JAVA-based 
one) can handle other encodings -- but you will have to make sure whatever XML 
processor you use, in whatever language it is written, has a handy-dandy MARC8 
coder/decoder ring

Sheila

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Tuesday, April 17, 2012 2:46 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MarcXML and char encodings

Thanks, this is helpful feedback at least.

I think it's completely irrelevant, when determining what is legal under 
standards, to talk about what certain Java tools happen to do though, I 
don't care too much what some tool you happen to use does.

In this case, I'm _writing_ the tools. I want to make them do 'the right 
thing', with some mix of what's actually official legally correct and 
what's practically useful.  What your Java tools do is more or less 
irrelevant to me. I certainly _could_ make my tool respect the Marc 
leader encoded in MarcXML over the XML decleration if I wanted to. I 
could even make it assume the data is Marc8 in XML, even though there's 
no XML charset type for it, if the leader says it's Marc8.

But do others agree that there is in fact no legal way to have Marc8 in 
MarcXML?

Do others agree that you can use non-UTF8 encodings in MarcXML, so long 
as they are legal XML?

I won't even ask someone to cite standards documents, because it's 
pretty clear that LC forgot to consider this when establishing MarcXML.  
(And I have no faith that one could get LC to make a call on this and 
publish it any time this century).

Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How 
is it represented with regard to the XML leader and the Marc header?

Has anyone seen any MarcXML with char encodings that are neither Marc8 
nor UTF8 in the wild? Are they common? How are they represented with 
regard to XML leader and Marc header?

On 4/17/2012 2:32 PM, LeVan,Ralph wrote:
 If I want to have a MarcXML document encoded in Marc8 -- what should
 it
 look like?  What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?  Or is it not in fact legal at all?
 I'm going out on a limb here, but I don't think it is legal.  There is
 no formal encoding that corresponds to MARC-8, so there's no way to tell
 XML tools how to interpret the bytes.


 If I want to have a MarcXML document encoded in UTF8, what should it
 look like? What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?
 ?xml encoding=UTF-8?

 I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
 really matter to any XML tools.


 If I want to have a MarcXML document with a char encoding that is
 _neither_ Marc8 nor UTF8, but something else generally legal for XML


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Houghton,Andrew
 Jonathan Rochkind
 Sent: Tuesday, April 17, 2012 14:18
 Subject: Re: [CODE4LIB] MarcXML and char encodings
 
 Okay, maybe here's another way to approach the question.
 
 If I want to have a MarcXML document encoded in Marc8 -- what should it
 look like?  What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?  Or is it not in fact legal at all?
 
 If I want to have a MarcXML document encoded in UTF8, what should it
 look like? What should be in the XML decleration? What should be in the
 MARC header embedded in the XML?
 
 If I want to have a MarcXML document with a char encoding that is
 _neither_ Marc8 nor UTF8, but something else generally legal for XML --
 is this legal at all? And if so, what should it look like? What should
 be in the XML decleration? What should be in the MARC header embedded
 in
 the XML?

You cannot have a MARC-XML document encoded in MARC-8, well sort of, but it's 
not standard. To answer your questions you have to refer to a variety of 
standards:

http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncodingDecl
In an encoding declaration, the values  UTF-8 ,  UTF-16 ,  ISO-10646-UCS-2 
, and  ISO-10646-UCS-4  should be used for the various encodings and 
transformations of Unicode / ISO/IEC 10646, the values  ISO-8859-1 ,  
ISO-8859-2 , ...  ISO-8859- n  (where n is the part number) should be used 
for the parts of ISO 8859, and the values  ISO-2022-JP ,  Shift_JIS , and  
EUC-JP  should be used for the various encoded forms of JIS X-0208-1997. It is 
recommended that character encodings registered (as charsets) with the Internet 
Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be 
referred to using their registered names; other encodings should use names 
starting with an x- prefix. XML processors should match character encoding 
names in a case-insensitive way and should either interpret an IANA-registered 
name as the encoding registered at IANA for that name or treat it as unknown 
(processors are, of course, not required to support all IANA-!
 registered encodings).

In the absence of information provided by an external transport protocol (e.g. 
HTTP or MIME), it is a fatal error for an entity including an encoding 
declaration to be presented to the XML processor in an encoding other than that 
named in the declaration, or for an entity which begins with neither a Byte 
Order Mark nor an encoding declaration to use an encoding other than UTF-8. 
Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not 
strictly need an encoding declaration.


1) The above says that ?xml version=1.0 ? means the same as ?xml 
version=1.0 encoding=utf-8 ? and if you prefer you can omit the XML 
declaration and that is assumed to be UTF-8 unless there is a BOM (Byte Order 
Mark) which determines UTF-8 vs UTF-16BE vs UTF-16LE.

2) If you really wanted to encode the XML in MARC-8 you need to specify x- 
since if you refer to: http://www.iana.org/assignments/character-sets MARC-8 
isn't a registered character set, hence cannot be specified in the encoding 
attribute unless the name was prefixed with x-. Which implies that no 
standard XML library will know how to convert the MARC-8 characters into 
Unicode so the XML DOM can be used. So unless you want to write your own MARC-8 
= Unicode conversion routines and integrate them your preferred XML library 
it isn't going to work out of the box for anyone else but yourself.

When dealing with MARC-XML you should ignore the values in LDR/00-04, LDR/10, 
LDR/11, LDR/12-16, LDR/20-23. If you look at the MARC-XML schema you will note 
that the definition for leaderDataType specifies LDR/00-04 [\d ]{5}, LDR/10 
and LDR/11 (2| ), LDR/12-16 [\d ]{5}, LDR/20-23 (4500| ). Note the 
MARC-XML schema allows spaces in those positions because they are not relevant 
in the XML format, though very relevant in the binary format.

You probably should ignore LDR/09 since most MARC to MARC-XML converters do not 
change this value to 'a' although many converters do change the value when 
converting MARC binary between MARC-8 and UTF-8. The only valid character set 
for MARC-XML is Unicode and it *should* be encoded in UTF-8 in Unicode 
normalization form D (NFD) although most XML libraries will not know the 
difference if it was encoded as UTF-16BE or UTF-16LE in Unicode normalization 
form D since the XML libraries internally work with Unicode.

I could have sworn that this information was specified on LC's site at one 
point in time, but I'm having trouble finding the documentation.


Hope this helps, Andy.


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Kyle Banerjee

 So would you use the Marc header payload instead?


 Or you're just saying you wouldn't trust _any_ encoding declerations you
 find anywhere?


This.

The short version is that too many vendors and systems just supply some
value without making sure that's what they're spitting out. I haven't had
to mess with this stuff for a few years, so I'm hoping Terry Reese weighs
in on this conversation -- he has a lot of experience dealing with encoding
headaches. However, the bottom line is that the most reliable method is to
use heuristics to detect what's going on. Yeah, that totally kills the
point of listing encodings in first place, but just as is the case with any
unreliably used data point, it's all GIGO.

When writing a library to handle marc, I think the base line should be
 making it do the official legal standards-complaint right thing.  Extra
 heuristics to deal with invalid data can be added on top.


I'm hoping things have improved, but if heuristics are more reliable than
reading the right areas of the record, you have to ignore what's there
(which makes even reading it pointless). I do think there is value in
encouraging vendors to actually pay attention to this stuff as such basic
screwups undermine both the the credibility of the data source and the
service that depends on the data.


 But my trouble here is I can't even figure out what the official legal
 standards-compliant thing is.

 Maybe that's becuase the MarcXML standard simply doesn't address it, and
 it's all implementation dependent. sigh.

 The problem is how the XML documents own char encoding is supposed to
 interact with the MARC header; especially because there's no way to put
 Marc8 in an XML char encoding doctype (is there?);  and whether encodings
 other than Marc8 or UTF8 are legal in MarcXML, even though they aren't in
 MARC ISO binary.

 I think the answer might be nobody knows, and there is no standard right
 way to do it. Which is unfortunate.


A good summary of the situation as I understand it.

kyle

-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edu / 503.999.9787


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Karen Coyle
The discussions at the MARC standards group relating to Unicode all had 
to do with using Unicode *within* ISO2709. I can't find any evidence 
that MARCXML ever went through the standards process. (This may not be a 
bad thing.) So none of what we know about the MARBI discussions and 
resulting standards can really help us here, except perhaps by analogy.


In LC's own example on the MARCXML page (the Sandburg example) the 
Leader is copied without change from the ISO2709/MARC-8 record to the 
MARCXML/Unicode record -- in other words, it still has a blank in offset 
09, which means MARC-8. (The XML record is UTF-8.) My gut feeling is 
that the Leader in MARCXML should be treated like the human appendix -- 
something that once had a use, but is now just being carried along for 
historical reasons. I would not expect it to reflect the XML record 
within which it is embedded. Unfortunately, it is the only source of 
some key information, like type of record. The more I think about it, 
the more MARCXML strikes me as a really messed-up format.


kc



On 4/17/12 11:46 AM, Jonathan Rochkind wrote:

Thanks, this is helpful feedback at least.

I think it's completely irrelevant, when determining what is legal under
standards, to talk about what certain Java tools happen to do though, I
don't care too much what some tool you happen to use does.

In this case, I'm _writing_ the tools. I want to make them do 'the right
thing', with some mix of what's actually official legally correct and
what's practically useful. What your Java tools do is more or less
irrelevant to me. I certainly _could_ make my tool respect the Marc
leader encoded in MarcXML over the XML decleration if I wanted to. I
could even make it assume the data is Marc8 in XML, even though there's
no XML charset type for it, if the leader says it's Marc8.

But do others agree that there is in fact no legal way to have Marc8 in
MarcXML?

Do others agree that you can use non-UTF8 encodings in MarcXML, so long
as they are legal XML?

I won't even ask someone to cite standards documents, because it's
pretty clear that LC forgot to consider this when establishing MarcXML.
(And I have no faith that one could get LC to make a call on this and
publish it any time this century).

Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How
is it represented with regard to the XML leader and the Marc header?

Has anyone seen any MarcXML with char encodings that are neither Marc8
nor UTF8 in the wild? Are they common? How are they represented with
regard to XML leader and Marc header?

On 4/17/2012 2:32 PM, LeVan,Ralph wrote:

If I want to have a MarcXML document encoded in Marc8 -- what should

it

look like? What should be in the XML decleration? What should be in

the

MARC header embedded in the XML? Or is it not in fact legal at all?

I'm going out on a limb here, but I don't think it is legal. There is
no formal encoding that corresponds to MARC-8, so there's no way to tell
XML tools how to interpret the bytes.



If I want to have a MarcXML document encoded in UTF8, what should it
look like? What should be in the XML decleration? What should be in

the

MARC header embedded in the XML?

?xml encoding=UTF-8?

I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
really matter to any XML tools.



If I want to have a MarcXML document with a char encoding that is
_neither_ Marc8 nor UTF8, but something else generally legal for XML


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Houghton,Andrew
 Karen Coyle
 Sent: Tuesday, April 17, 2012 15:41
 Subject: Re: [CODE4LIB] MarcXML and char encodings
 
 The discussions at the MARC standards group relating to Unicode all had
 to do with using Unicode *within* ISO2709. I can't find any evidence
 that MARCXML ever went through the standards process. (This may not be
 a
 bad thing.) So none of what we know about the MARBI discussions and
 resulting standards can really help us here, except perhaps by analogy.

Well I can confirm that the MARCXML didn't go through MARBI since I was
one of OCLC's representatives who solidified MARCXML. MARCXML came out
of a meeting at LC between the MARC Standards office, OCLC, RLG, and 
one or two other interested parties whom I cannot remember or find in
my emails or notes about the meeting.


Andy.


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Decasm
Let me make some recommendations. These are what I would consider best 
practices for interoperability.

1) Never put marc8 in xml. Just don't do it. No one expects it. Few will be 
willing to bother with it.

2) Always prefer utf8 for marcxml. You can use any standard charset if you need 
 to, but without special circumstances, use utf8

3) ignore leader 9 in marcxml. Only consider the prolog. (consider not trust.)
If you reasonably can, fail when the charset is Wrong.

/dev

Sent via the Samsung Galaxy S™ II Skyrocket™, an ATT 4G LTE smartphone.

 Original message 
Subject: Re: [CODE4LIB] MarcXML and char encodings 
From: Jonathan Rochkind rochk...@jhu.edu 
To: CODE4LIB@LISTSERV.ND.EDU 
CC:  

Thanks, this is helpful feedback at least.

I think it's completely irrelevant, when determining what is legal under 
standards, to talk about what certain Java tools happen to do though, I 
don't care too much what some tool you happen to use does.

In this case, I'm _writing_ the tools. I want to make them do 'the right 
thing', with some mix of what's actually official legally correct and 
what's practically useful.  What your Java tools do is more or less 
irrelevant to me. I certainly _could_ make my tool respect the Marc 
leader encoded in MarcXML over the XML decleration if I wanted to. I 
could even make it assume the data is Marc8 in XML, even though there's 
no XML charset type for it, if the leader says it's Marc8.

But do others agree that there is in fact no legal way to have Marc8 in 
MarcXML?

Do others agree that you can use non-UTF8 encodings in MarcXML, so long 
as they are legal XML?

I won't even ask someone to cite standards documents, because it's 
pretty clear that LC forgot to consider this when establishing MarcXML.  
(And I have no faith that one could get LC to make a call on this and 
publish it any time this century).

Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How 
is it represented with regard to the XML leader and the Marc header?

Has anyone seen any MarcXML with char encodings that are neither Marc8 
nor UTF8 in the wild? Are they common? How are they represented with 
regard to XML leader and Marc header?

On 4/17/2012 2:32 PM, LeVan,Ralph wrote:
 If I want to have a MarcXML document encoded in Marc8 -- what should
 it
 look like?  What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?  Or is it not in fact legal at all?
 I'm going out on a limb here, but I don't think it is legal.  There is
 no formal encoding that corresponds to MARC-8, so there's no way to tell
 XML tools how to interpret the bytes.


 If I want to have a MarcXML document encoded in UTF8, what should it
 look like? What should be in the XML decleration? What should be in
 the
 MARC header embedded in the XML?
 ?xml encoding=UTF-8?

 I suppose you'll want to set the leader to UTF-8 as well, but it doesn't
 really matter to any XML tools.


 If I want to have a MarcXML document with a char encoding that is
 _neither_ Marc8 nor UTF8, but something else generally legal for XML


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Jonathan Rochkind

On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote:

No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in 
the XML prolog,


Wait, how canyou declare a Marc8 encoding in an XML 
decleration/prolog/whatever it's called?


The things that appear there need to be from a specific list, and I 
didn't think Marc8 was on that list?


Can you give me an example?  And, if you happen to have it, link to XML 
standard that says this is legal?


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread LeVan,Ralph
 No -- it is perfectly legal - -but you MUST declare the encoding to
BE Marc8 in the XML prolog,

 Wait, how canyou declare a Marc8 encoding in an XML 
 decleration/prolog/whatever it's called?

Nope, you can't do that.  There is no approved name for the MARC-8
encoding.  As Andy said, the closest you could get would be to make up
an experimental name, like x-marc-8, but no tool in the world would
recognize that.

Ralph


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey
In XML standard:

It is RECOMMENDED that character encodings registered (as charsets) 
with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those 
just listed, be referred to usingtheir registered names; other encodings 
SHOULD use names starting with an x- prefix. XML processors SHOULD match 
character encoding names in a case-insensitive way and SHOULDeither 
interpret an IANA-registered name as the encoding registered at IANA for that 
name or treat it as unknown (processors are, of course, not required to support 
all IANA-  registered encodings).


As I suggested -- since MARC8 isn't (so far as I know) registered -- you won't 
get far with most standard tools, in whatever language -- you'll have to extend 
them to first recognize the encoding name, and second, decode the content.

smm

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Tuesday, April 17, 2012 4:19 PM
To: Code for Libraries
Cc: Sheila M. Morrissey
Subject: Re: [CODE4LIB] MarcXML and char encodings



On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote:
 No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 
 in the XML prolog,

Wait, how canyou declare a Marc8 encoding in an XML 
decleration/prolog/whatever it's called?

The things that appear there need to be from a specific list, and I 
didn't think Marc8 was on that list?

Can you give me an example?  And, if you happen to have it, link to XML 
standard that says this is legal?


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Eric Lease Morgan
MARC-8. Cool in its time. Dumb now. Typical. --ELM


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Sheila M. Morrissey
I think this is a case of being in violent agreement -- see some earlier 
replies in this thread -- 
Pragmatically, if you are going to hew to marc-8 encoding transported in XML -- 
you are losing the usefulness of standard tools for xml --
smm

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
LeVan,Ralph
Sent: Tuesday, April 17, 2012 4:21 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MarcXML and char encodings

 No -- it is perfectly legal - -but you MUST declare the encoding to
BE Marc8 in the XML prolog,

 Wait, how canyou declare a Marc8 encoding in an XML 
 decleration/prolog/whatever it's called?

Nope, you can't do that.  There is no approved name for the MARC-8
encoding.  As Andy said, the closest you could get would be to make up
an experimental name, like x-marc-8, but no tool in the world would
recognize that.

Ralph