RE: printing UTF-8 encoded MARC records with as_usmarc

PHILLIPS M.E. Wed, 01 Aug 2012 01:57:06 -0700

> -----Original Message-----
> From: Shelley Doljack [mailto:sdolj...@stanford.edu]
> Sent: 31 July 2012 20:18
>
> The problem was I wasn't telling perl to output UTF-8. Now that I added
> binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds
> like once I set binmode to UTF-8 everything will be interpreted as such, even
> when the record is in MARC-8. Is that right? So this means that I can only use
> my script with a file of records where all of them are encoded in UTF-8. If I
> want to run the script against a file with all MARC-8 encoding, then I'd need
> to remove the binmode line.


It depends how much manipulation of the records you are doing in the script.  
One approach is to use

binmode(FILE, ':raw');

for both input and output.  Perl will then keep the bytes of the records 
exactly as they are.  You won't be able to test  for exotic characters so 
easily, and amending field content would be inadvisable, but if all you are 
doing is something like reading in the records and filtering out any that have 
no 245 field, or something fairly basic like that, this could be the best 
approach.

The MARC::Record module does not seem to care how the records are encoded.  
It's only once you start altering field content, testing field content, or 
adding fields that the character set being used becomes an issue.  Removing 
fields would be fine too.

MARC-8 can be very complex, particularly if other code tables like CJK are 
invoked, or even just Greek or Cyrillic.  If you were manipulating field 
content in that kind of way they converting everything to UTF-8 would make 
things very much easier.

Matthew

-- 
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941

RE: printing UTF-8 encoded MARC records with as_usmarc

Reply via email to