-Original Message-
From: Shelley Doljack [mailto:sdolj...@stanford.edu]
Sent: 31 July 2012 20:18
The problem was I wasn't telling perl to output UTF-8. Now that I added
binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds
like once I set binmode to UTF-8 everything will be interpreted as such, even
when the record is in MARC-8. Is that right? So this means that I can only use
my script with a file of records where all of them are encoded in UTF-8. If I
want to run the script against a file with all MARC-8 encoding, then I'd need
to remove the binmode line.
It depends how much manipulation of the records you are doing in the script.
One approach is to use
binmode(FILE, ':raw');
for both input and output. Perl will then keep the bytes of the records
exactly as they are. You won't be able to test for exotic characters so
easily, and amending field content would be inadvisable, but if all you are
doing is something like reading in the records and filtering out any that have
no 245 field, or something fairly basic like that, this could be the best
approach.
The MARC::Record module does not seem to care how the records are encoded.
It's only once you start altering field content, testing field content, or
adding fields that the character set being used becomes an issue. Removing
fields would be fine too.
MARC-8 can be very complex, particularly if other code tables like CJK are
invoked, or even just Greek or Cyrillic. If you were manipulating field
content in that kind of way they converting everything to UTF-8 would make
things very much easier.
Matthew
--
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941