printing UTF-8 encoded MARC records with as_usmarc

2012-07-30 Thread Shelley Doljack
Hi,

I wrote a script that extracts marc records from a file given certain 
conditions and puts them in a new file. When my input record is correctly 
encoded in UTF-8 and I run my script from windows command prompt, this warning 
message appears: Wide character in print at record_extraction.pl line 99 (the 
line in my script where I print to a new file using as_usmarc). I compared the 
extracted record before and after in MarcEdit and the diacritic was changed. I 
tried marcdump newfile.mrc to see what happens and I get this error: utf8 \xF4 
does not map to Unicode at C:/Perl64/lib/Encode.pm line 176. When I run my 
extraction script again with MARC-8 encoded data then I don't have the same 
problem. 

The basic outline of my script is:

my $batch = MARC::Batch-new('USMARC', $input_file);

while (my $record = $batch-next()) {
 #do some checks
 #if checks ok then
 print FILE $record-as_usmarc();
}

Do I need to add something that specifies to interpret the data as UTF-8? Does 
MARC::Record not handle UTF-8 at all? 

Thanks,
Shelley


Shelley Doljack  
E-Resources Metadata Librarian 
Metadata and Library Systems
Stanford University Libraries
sdolj...@stanford.edu
650-725-0167


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-30 Thread William Dueber
First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty
of that crap.

You need to tell perl that you'll be outputting UTF-8 using 'bincode'

  binmode(FILE, ':utf8');

In general, you'll want to do this to basically every file you open for
reading or writing.

A great overview of Perl and UTF-8 can be found at:

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default





On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.eduwrote:

 Hi,

 I wrote a script that extracts marc records from a file given certain
 conditions and puts them in a new file. When my input record is correctly
 encoded in UTF-8 and I run my script from windows command prompt, this
 warning message appears: Wide character in print at record_extraction.plline 
 99 (the line in my script where I print to a new file using
 as_usmarc). I compared the extracted record before and after in MarcEdit
 and the diacritic was changed. I tried marcdump newfile.mrc to see what
 happens and I get this error: utf8 \xF4 does not map to Unicode at
 C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again
 with MARC-8 encoded data then I don't have the same problem.

 The basic outline of my script is:

 my $batch = MARC::Batch-new('USMARC', $input_file);

 while (my $record = $batch-next()) {
  #do some checks
  #if checks ok then
  print FILE $record-as_usmarc();
 }

 Do I need to add something that specifies to interpret the data as UTF-8?
 Does MARC::Record not handle UTF-8 at all?

 Thanks,
 Shelley

 
 Shelley Doljack
 E-Resources Metadata Librarian
 Metadata and Library Systems
 Stanford University Libraries
 sdolj...@stanford.edu
 650-725-0167




-- 

Bill Dueber
Programmer -- Library Systems
University of Michigan