Sorry for the delayed reply.

On Fri, Aug 30, 2013 at 12:12 PM, Karen Coyle <[email protected]> wrote:

> I believe this thread started on ol-discuss, but it's now "techincal." I
> tried running the test set of 100 records through marcedit, and got an
> error. I suspect that the problem is with the character set because I
> was able to validate the records (which I believe just looks at
> structure) with that same program. Looking at the raw data, it looks to
> me like the records are using the "non-filing" elements that were added
> to the MARC standard but were never implemented in the US. So this (in
> hex):
>
> 0x1f 0x98 0x61 0x44 0x61 0x73 0x9c
>
> is the first part of
>
>  a˜Dasœ Imiut
>
> Where the "a" and "s" are printing out as the non-filing characters.
> (The records claim to be in utf-8)
>
> Because this never was implemented in the US it isn't documented in the
> MARC documentation. The latest info I can find is a 1998 proposal [1]
>

It looks like the 1998 proposal was approved according to these guidelines
from June:
http://www.loc.gov/marc/nonsorting.html


> that the control characters are:
>
> Hex 'X88' nonsorting character, begin
> Hex 'X89' nonsorting character, end
>
> (I believe those are ASCII characters, not Unicode.)
>

I don't think they're ASCII because they'd é and ë which would conflict
with normal characters.  The proposal says that they're drawn from ISO 6630
Bibliographic control characters but it'd take CHF 50 to find out what that
specs says or what character set it's based on.

OK, after maze of documents all pointing at each other, I found a place
that defines this in a useful fashion:
http://lcweb2.loc.gov/diglib/codetables/45.html

MARC-8MARC-8
as C1UCSUTF-8CHARC?NAMEALTALT UTF-8880098C298˜NON-SORT BEGIN / START OF
STRING89009CC29CœNON-SORT END / STRING TERMINATOR
which explains the oe ligature in your data, although the graphic
representation doesn't mean it's the same as the real tilde and oe
ligature.  The real tilde has UTF-8 representation of 0x7E instead of
0xC298.

The weird thing is that your data seems to have the raw 0x98 and 0x9C
without the 0xC2 byte introducing them.  That doesn't seem correct on the
surface, but I'm not sure where you cut & pasted your data from.


> For OL (which doesn't really need non-filing characters, I believe) we
> could just strip these characters out. If someone could strip them out
> of the current set I could run marcedit again. I'm just trying to get a
> good look at the records to see if they'll translate well to OL fields.
>

Rather than futzing around with closed source marcedit, could I just use
PyMarc to make a formatted dump of a few records for you?

Tom


> I'm heading off for 10 days to the Dublin Core conference in Lisbon. If
> anyone else has time to do analysis on this, please feel free:
>
> http://archive.org/details/marc21_records_german_national_library
>
> kc
>
> [1] http://www.loc.gov/marc/marbi/1998/98-16r.html
>
>
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to