First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty
of that crap.
You need to tell perl that you'll be outputting UTF-8 using 'bincode'
binmode(FILE, ':utf8');
In general, you'll want to do this to basically every file you open for
reading or writing.
A great overview of Perl and UTF-8 can be found at:
http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default
On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.eduwrote:
Hi,
I wrote a script that extracts marc records from a file given certain
conditions and puts them in a new file. When my input record is correctly
encoded in UTF-8 and I run my script from windows command prompt, this
warning message appears: Wide character in print at record_extraction.plline
99 (the line in my script where I print to a new file using
as_usmarc). I compared the extracted record before and after in MarcEdit
and the diacritic was changed. I tried marcdump newfile.mrc to see what
happens and I get this error: utf8 \xF4 does not map to Unicode at
C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again
with MARC-8 encoded data then I don't have the same problem.
The basic outline of my script is:
my $batch = MARC::Batch-new('USMARC', $input_file);
while (my $record = $batch-next()) {
#do some checks
#if checks ok then
print FILE $record-as_usmarc();
}
Do I need to add something that specifies to interpret the data as UTF-8?
Does MARC::Record not handle UTF-8 at all?
Thanks,
Shelley
Shelley Doljack
E-Resources Metadata Librarian
Metadata and Library Systems
Stanford University Libraries
sdolj...@stanford.edu
650-725-0167
--
Bill Dueber
Programmer -- Library Systems
University of Michigan