RE: printing UTF-8 encoded MARC records with as_usmarc

Doran, Michael D Wed, 15 Aug 2012 08:59:37 -0700

Hi Devon,

> I just recently came across this presentation which lays out pretty much
> all the issues with Unicode in perl, and makes some recommendations for
> best practices.


While Nick Patch's presentation is excellent, I'm not sure that it "lays out 
pretty much all the issues with Unicode in perl".  ;-)

To fit that bill, I highly recommend this series of talks given by Tom 
Christiansen at OSCON 2011:

 1. Perl Unicode Essentials
 2. Unicode in Perl Regexes
 3. Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly

http://training.perl.com/OSCON2011/index.html
(resolves to http://98.245.80.27/tcpc/OSCON2011/index.html)

If you read through those presentations and disagree, I promise to buy you a 
beer at the next conference (code4lib?) we both attend.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/

> -----Original Message-----
> From: Smith,Devon [mailto:smit...@oclc.org]
> Sent: Tuesday, July 31, 2012 8:26 AM
> To: William Dueber; Shelley Doljack
> Cc: perl4lib@perl.org
> Subject: RE: printing UTF-8 encoded MARC records with as_usmarc
> 
> I just recently came across this presentation which lays out pretty much
> all the issues with Unicode in perl, and makes some recommendations for
> best practices. You may find some general insight into the whole
> situation by going over it.
> 
> http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-
> perl-workshop-2012
> 
> /dev
> --
> Devon Smith
> Consulting Software Engineer
> OCLC Research
> http://www.oclc.org/research/people/smith.htm
> 
> 
> -----Original Message-----
> From: William Dueber [mailto:dueb...@umich.edu]
> Sent: Monday, July 30, 2012 8:14 PM
> To: Shelley Doljack
> Cc: perl4lib@perl.org
> Subject: Re: printing UTF-8 encoded MARC records with as_usmarc
> 
> First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
> MARC-8, perhaps just lousy characters) in your MARC. I know we have
> plenty
> of that crap.
> 
> You need to tell perl that you'll be outputting UTF-8 using 'bincode'
> 
>   binmode(FILE, ':utf8');
> 
> In general, you'll want to do this to basically every file you open for
> reading or writing.
> 
> A great overview of Perl and UTF-8 can be found at:
> 
> http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-
> utf-8-by-default
> 
> 
> 
> 
> 
> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack
> <sdolj...@stanford.edu>wrote:
> 
> > Hi,
> >
> > I wrote a script that extracts marc records from a file given certain
> > conditions and puts them in a new file. When my input record is
> correctly
> > encoded in UTF-8 and I run my script from windows command prompt, this
> > warning message appears: "Wide character in print at
> record_extraction.plline 99" (the line in my script where I print to a
> new file using
> > as_usmarc). I compared the extracted record before and after in
> MarcEdit
> > and the diacritic was changed. I tried marcdump newfile.mrc to see what
> > happens and I get this error: "utf8 \xF4 does not map to Unicode at
> > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script
> again
> > with MARC-8 encoded data then I don't have the same problem.
> >
> > The basic outline of my script is:
> >
> > my $batch = MARC::Batch->new('USMARC', $input_file);
> >
> > while (my $record = $batch->next()) {
> >      #do some checks
> >      #if checks ok then
> >      print FILE $record->as_usmarc();
> > }
> >
> > Do I need to add something that specifies to interpret the data as UTF-
> 8?
> > Does MARC::Record not handle UTF-8 at all?
> >
> > Thanks,
> > Shelley
> >
> > ----
> > Shelley Doljack
> > E-Resources Metadata Librarian
> > Metadata and Library Systems
> > Stanford University Libraries
> > sdolj...@stanford.edu
> > 650-725-0167
> >
> 
> 
> 
> --
> 
> Bill Dueber
> Programmer -- Library Systems
> University of Michigan

RE: printing UTF-8 encoded MARC records with as_usmarc

Reply via email to