On 4/18/2012 12:08 PM, Jonathan Rochkind wrote:
On 4/18/2012 11:09 AM, Doran, Michael D wrote:
I don't believe that is the case. Take UTF-8 out of the picture, and
consider the MARC-8 character set with its escape sequences and
combining characters. A character such as an "n" with a tilde would
consist of two bytes. The Greek small letter alpha, if invoked in
accordance with ANSI X3.41, would consist of five bytes (two bytes
for the initial escape sequence, a byte for the character, and then
two bytes for the escape sequence returning to the default character
set).
ISO 2709 doesn't care how many bytes your characters are. The
directory and offsets and other things count bytes, not characters.
(which was, in my opinion, the _right_ decision, for once with marc!)
How bytes translate into characters is not a concern of ISO 2709.
The majority of non-7-bit-ASCII encodings will have chars that are
more than one byte, either sometimes or always. This is true of MARC8
(some chars), UTF8 (some chars), and UTF16 (all chars), all of them.
(It is not true of Latin-1 though, for instance, I don't think).
ISO 2709 doesn't care what char encodings you use, and there's no
standard ISO 2709 way to determine what char encodings are used for
_data_ in the MARC record. ISO 2709 does say that _structural_
elements like field names, subfield names, the directory itself,
seperator chars, etc, all need to be (essentially, over-simplifying)
7-bit-ASCII. The actual data itself is application dependent, 2709
doesn't care, and 2709 doesn't give any standard cross-2709 way to
determine it.
That is my conclusion at the moment, helped by all of you all in this
thread, thanks!
The conclusion that I came to in the work I have done on marc4j (which
is used heavily by SolrMarc) is that for any significant processing of
Marc records the only solution that makes sense is to translate the
record data into Unicode characters as it is being read in. Of course
as you and others have stated, determining what the data actually is, in
order to correctly translate it to Unicode, is no easy task. The leader
byte that merely indicates "is UTF8" or "is not UTF8" is wrong often
enough in the real world that it is of little value when it indicates
"is UTF-8"and is even less value when it indicates "is not UTF-8"
Significant portions of the code I've added to marc4j deal with trying
to determine what the encoding of that data actually is and trying to
translate the data correctly into Unicode even when the data is incorrect.
You also argued in another message that cataloger entry tools should
give feedback to help the cataloger not create errors. I agree. I
think one possible step towards this would be that the editor must work
in Unicode, irrespective of the data format that the underlying system
expects the data to be. If the underlying system expects MARC8 then the
"save as" process should be able to translate the data into MARC8 on
output.
-Robert Haschart