Sorry for the delayed reply. On Fri, Aug 30, 2013 at 12:12 PM, Karen Coyle <[email protected]> wrote:
> I believe this thread started on ol-discuss, but it's now "techincal." I > tried running the test set of 100 records through marcedit, and got an > error. I suspect that the problem is with the character set because I > was able to validate the records (which I believe just looks at > structure) with that same program. Looking at the raw data, it looks to > me like the records are using the "non-filing" elements that were added > to the MARC standard but were never implemented in the US. So this (in > hex): > > 0x1f 0x98 0x61 0x44 0x61 0x73 0x9c > > is the first part of > > a˜Dasœ Imiut > > Where the "a" and "s" are printing out as the non-filing characters. > (The records claim to be in utf-8) > > Because this never was implemented in the US it isn't documented in the > MARC documentation. The latest info I can find is a 1998 proposal [1] > It looks like the 1998 proposal was approved according to these guidelines from June: http://www.loc.gov/marc/nonsorting.html > that the control characters are: > > Hex 'X88' nonsorting character, begin > Hex 'X89' nonsorting character, end > > (I believe those are ASCII characters, not Unicode.) > I don't think they're ASCII because they'd é and ë which would conflict with normal characters. The proposal says that they're drawn from ISO 6630 Bibliographic control characters but it'd take CHF 50 to find out what that specs says or what character set it's based on. OK, after maze of documents all pointing at each other, I found a place that defines this in a useful fashion: http://lcweb2.loc.gov/diglib/codetables/45.html MARC-8MARC-8 as C1UCSUTF-8CHARC?NAMEALTALT UTF-8880098C298˜NON-SORT BEGIN / START OF STRING89009CC29CœNON-SORT END / STRING TERMINATOR which explains the oe ligature in your data, although the graphic representation doesn't mean it's the same as the real tilde and oe ligature. The real tilde has UTF-8 representation of 0x7E instead of 0xC298. The weird thing is that your data seems to have the raw 0x98 and 0x9C without the 0xC2 byte introducing them. That doesn't seem correct on the surface, but I'm not sure where you cut & pasted your data from. > For OL (which doesn't really need non-filing characters, I believe) we > could just strip these characters out. If someone could strip them out > of the current set I could run marcedit again. I'm just trying to get a > good look at the records to see if they'll translate well to OL fields. > Rather than futzing around with closed source marcedit, could I just use PyMarc to make a formatted dump of a few records for you? Tom > I'm heading off for 10 days to the Dublin Core conference in Lisbon. If > anyone else has time to do analysis on this, please feel free: > > http://archive.org/details/marc21_records_german_national_library > > kc > > [1] http://www.loc.gov/marc/marbi/1998/98-16r.html > >
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
