Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

Robert Haschart Thu, 19 Apr 2012 14:25:56 -0700

On 4/18/2012 12:08 PM, Jonathan Rochkind wrote:

On 4/18/2012 11:09 AM, Doran, Michael D wrote:
I don't believe that is the case. Take UTF-8 out of the picture, andconsider the MARC-8 character set with its escape sequences andcombining characters. A character such as an "n" with a tilde wouldconsist of two bytes. The Greek small letter alpha, if invoked inaccordance with ANSI X3.41, would consist of five bytes (two bytesfor the initial escape sequence, a byte for the character, and thentwo bytes for the escape sequence returning to the default characterset).
ISO 2709 doesn't care how many bytes your characters are. Thedirectory and offsets and other things count bytes, not characters.(which was, in my opinion, the _right_ decision, for once with marc!)
How bytes translate into characters is not a concern of ISO 2709.
The majority of non-7-bit-ASCII encodings will have chars that aremore than one byte, either sometimes or always. This is true of MARC8(some chars), UTF8 (some chars), and UTF16 (all chars), all of them.(It is not true of Latin-1 though, for instance, I don't think).
ISO 2709 doesn't care what char encodings you use, and there's nostandard ISO 2709 way to determine what char encodings are used for_data_ in the MARC record. ISO 2709 does say that _structural_elements like field names, subfield names, the directory itself,seperator chars, etc, all need to be (essentially, over-simplifying)7-bit-ASCII. The actual data itself is application dependent, 2709doesn't care, and 2709 doesn't give any standard cross-2709 way todetermine it.
That is my conclusion at the moment, helped by all of you all in thisthread, thanks!

The conclusion that I came to in the work I have done on marc4j (whichis used heavily by SolrMarc) is that for any significant processing ofMarc records the only solution that makes sense is to translate therecord data into Unicode characters as it is being read in. Of courseas you and others have stated, determining what the data actually is, inorder to correctly translate it to Unicode, is no easy task. The leaderbyte that merely indicates "is UTF8" or "is not UTF8" is wrong oftenenough in the real world that it is of little value when it indicates"is UTF-8"and is even less value when it indicates "is not UTF-8"

Significant portions of the code I've added to marc4j deal with tryingto determine what the encoding of that data actually is and trying totranslate the data correctly into Unicode even when the data is incorrect.

You also argued in another message that cataloger entry tools shouldgive feedback to help the cataloger not create errors. I agree. Ithink one possible step towards this would be that the editor must workin Unicode, irrespective of the data format that the underlying systemexpects the data to be. If the underlying system expects MARC8 then the"save as" process should be able to translate the data into MARC8 onoutput.


-Robert Haschart

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

Reply via email to