The only language that I know of with a library for reading Marc8 and
converting to another encoding (such as UTF-8) is Java. The Marc4J
package will do it.
I suppose there may be C libraries too; is yaz written in C?
As Michael suggests the easiest thing to do (if you're not in Java) is
probably to use the 'yaz' tools to convert to UTF-8 before anything else
touches it.
If you do end up writing a Marc8 handling library in another language
like Perl (presumably you could use the Java code in Marc4J as a guide),
please do share! Heh.
On 10/24/2011 2:34 PM, Doran, Michael D wrote:
Hi Eric,
In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?
You can't. MARC-8 is a character set that is unknown to the operating system.
Your best bet is to convert MARC-8-encoded records into UTF-8.
...it is converted it Perl's
internal encoding (UTF-8)
As an FTY, UTF-8 is *not* Perl's internal encoding.
-- Michael
# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric
Lease Morgan
Sent: Monday, October 24, 2011 1:18 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] marc-8
In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?
Character encoding is the bane of my existence. I have learned that when
reading from a file I ought to specify the type of encoding the file is in
and decode accordingly, or else. Once read, it is converted it Perl's
internal encoding (UTF-8) and can be manipulated. Similarly, when writing I
am expected to specify the encoding. Both the reading (decoding) and the
writing (encoding) can be done with the Encode module. Here is a some code
illustrating what I'm trying to do with MARC records which are apparently in
MARC-8:
# require
use Encode qw( encode decode );
# initialize
my $batch = MARC::Batch->new( 'USMARC', './records.mrc' );
open OUT, '> updated.mrc';
# process each record
while ( my $marc = $batch->next ) {
# get the title
my $_245 = decode( 'FOO', $marc->title );
# do cool stuff with the title here
# output the cool stuff
print OUT encode( 'FOO', $_245 );
}
# done
close OUT;
exit;
My problem is, I don't know what to put in place of FOO. What is the official
name of MARC-8's encoding scheme?
--
Eric "The Ugly American" Morgan
University of Notre Dame
(574) 631-8604