RE: printing UTF-8 encoded MARC records with as_usmarc
Hi Devon, > I just recently came across this presentation which lays out pretty much > all the issues with Unicode in perl, and makes some recommendations for > best practices. While Nick Patch's presentation is excellent, I'm not sure that it "lays out pretty much all the issues with Unicode in perl". ;-) To fit that bill, I highly recommend this series of talks given by Tom Christiansen at OSCON 2011: 1. Perl Unicode Essentials 2. Unicode in Perl Regexes 3. Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly http://training.perl.com/OSCON2011/index.html (resolves to http://98.245.80.27/tcpc/OSCON2011/index.html) If you read through those presentations and disagree, I promise to buy you a beer at the next conference (code4lib?) we both attend. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ > -Original Message- > From: Smith,Devon [mailto:smit...@oclc.org] > Sent: Tuesday, July 31, 2012 8:26 AM > To: William Dueber; Shelley Doljack > Cc: perl4lib@perl.org > Subject: RE: printing UTF-8 encoded MARC records with as_usmarc > > I just recently came across this presentation which lays out pretty much > all the issues with Unicode in perl, and makes some recommendations for > best practices. You may find some general insight into the whole > situation by going over it. > > http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore- > perl-workshop-2012 > > /dev > -- > Devon Smith > Consulting Software Engineer > OCLC Research > http://www.oclc.org/research/people/smith.htm > > > -Original Message- > From: William Dueber [mailto:dueb...@umich.edu] > Sent: Monday, July 30, 2012 8:14 PM > To: Shelley Doljack > Cc: perl4lib@perl.org > Subject: Re: printing UTF-8 encoded MARC records with as_usmarc > > First off, it's entirely possible that you have bad UTF-8 (perhaps rogue > MARC-8, perhaps just lousy characters) in your MARC. I know we have > plenty > of that crap. > > You need to tell perl that you'll be outputting UTF-8 using 'bincode' > > binmode(FILE, ':utf8'); > > In general, you'll want to do this to basically every file you open for > reading or writing. > > A great overview of Perl and UTF-8 can be found at: > > http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid- > utf-8-by-default > > > > > > On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack > wrote: > > > Hi, > > > > I wrote a script that extracts marc records from a file given certain > > conditions and puts them in a new file. When my input record is > correctly > > encoded in UTF-8 and I run my script from windows command prompt, this > > warning message appears: "Wide character in print at > record_extraction.plline 99" (the line in my script where I print to a > new file using > > as_usmarc). I compared the extracted record before and after in > MarcEdit > > and the diacritic was changed. I tried marcdump newfile.mrc to see what > > happens and I get this error: "utf8 \xF4 does not map to Unicode at > > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script > again > > with MARC-8 encoded data then I don't have the same problem. > > > > The basic outline of my script is: > > > > my $batch = MARC::Batch->new('USMARC', $input_file); > > > > while (my $record = $batch->next()) { > > #do some checks > > #if checks ok then > > print FILE $record->as_usmarc(); > > } > > > > Do I need to add something that specifies to interpret the data as UTF- > 8? > > Does MARC::Record not handle UTF-8 at all? > > > > Thanks, > > Shelley > > > > > > Shelley Doljack > > E-Resources Metadata Librarian > > Metadata and Library Systems > > Stanford University Libraries > > sdolj...@stanford.edu > > 650-725-0167 > > > > > > -- > > Bill Dueber > Programmer -- Library Systems > University of Michigan
Re: printing UTF-8 encoded MARC records with as_usmarc
Hi Matthew, Thanks for the advice. For this particular script, I'm not doing any data manipulation, so using :raw is probably the approach I want to take. I'm just feeding my script a list of record IDs and a MARC file in order to pull out records that have the record ID I'm looking for. Thanks, Shelley - Original Message - > From: "PHILLIPS M.E." > To: "Shelley Doljack" , perl4lib@perl.org > Sent: Wednesday, August 1, 2012 1:56:17 AM > Subject: RE: printing UTF-8 encoded MARC records with as_usmarc > > > -Original Message- > > From: Shelley Doljack [mailto:sdolj...@stanford.edu] > > Sent: 31 July 2012 20:18 > > > > The problem was I wasn't telling perl to output UTF-8. Now that I > > added > > binmode(FILE, ':utf8') to my script, the problem is fixed. However, > > it sounds > > like once I set binmode to UTF-8 everything will be interpreted as > > such, even > > when the record is in MARC-8. Is that right? So this means that I > > can only use > > my script with a file of records where all of them are encoded in > > UTF-8. If I > > want to run the script against a file with all MARC-8 encoding, > > then I'd need > > to remove the binmode line. > > It depends how much manipulation of the records you are doing in the > script. One approach is to use > > binmode(FILE, ':raw'); > > for both input and output. Perl will then keep the bytes of the > records exactly as they are. You won't be able to test for exotic > characters so easily, and amending field content would be > inadvisable, but if all you are doing is something like reading in > the records and filtering out any that have no 245 field, or > something fairly basic like that, this could be the best approach. > > The MARC::Record module does not seem to care how the records are > encoded. It's only once you start altering field content, testing > field content, or adding fields that the character set being used > becomes an issue. Removing fields would be fine too. > > MARC-8 can be very complex, particularly if other code tables like > CJK are invoked, or even just Greek or Cyrillic. If you were > manipulating field content in that kind of way they converting > everything to UTF-8 would make things very much easier. > > Matthew > > -- > Matthew Phillips > Electronic Systems Librarian, Durham University > Durham University Library, Stockton Road, Durham, DH1 3LY > +44 (0)191 334 2941 > > >
Re: printing UTF-8 encoded MARC records with as_usmarc
On Tue, Jul 31, 2012 at 09:25:55AM -0400, Smith,Devon wrote: > I just recently came across this presentation which lays out pretty much all > the issues with Unicode in perl, and makes some recommendations for best > practices. You may find some general insight into the whole situation by > going over it. In the course of preparing the latest edition of the Camel book Tom Christiansen created a Perl Unicode Cookbook see http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html Its available in a few different places on the web C. -- Colin Campbell Chief Software Engineer, PTFS Europe Limited Content Management and Library Solutions +44 (0) 800 756 6803 (phone) +44 (0) 7759 633626 (mobile) colin.campb...@ptfs-europe.com skype: colin_campbell2 http://www.ptfs-europe.com
RE: printing UTF-8 encoded MARC records with as_usmarc
> -Original Message- > From: Shelley Doljack [mailto:sdolj...@stanford.edu] > Sent: 31 July 2012 20:18 > > The problem was I wasn't telling perl to output UTF-8. Now that I added > binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds > like once I set binmode to UTF-8 everything will be interpreted as such, even > when the record is in MARC-8. Is that right? So this means that I can only use > my script with a file of records where all of them are encoded in UTF-8. If I > want to run the script against a file with all MARC-8 encoding, then I'd need > to remove the binmode line. It depends how much manipulation of the records you are doing in the script. One approach is to use binmode(FILE, ':raw'); for both input and output. Perl will then keep the bytes of the records exactly as they are. You won't be able to test for exotic characters so easily, and amending field content would be inadvisable, but if all you are doing is something like reading in the records and filtering out any that have no 245 field, or something fairly basic like that, this could be the best approach. The MARC::Record module does not seem to care how the records are encoded. It's only once you start altering field content, testing field content, or adding fields that the character set being used becomes an issue. Removing fields would be fine too. MARC-8 can be very complex, particularly if other code tables like CJK are invoked, or even just Greek or Cyrillic. If you were manipulating field content in that kind of way they converting everything to UTF-8 would make things very much easier. Matthew -- Matthew Phillips Electronic Systems Librarian, Durham University Durham University Library, Stockton Road, Durham, DH1 3LY +44 (0)191 334 2941
Re: printing UTF-8 encoded MARC records with as_usmarc
On Wed, Aug 1, 2012 at 12:47 AM, Shelley Doljack wrote: > The problem was I wasn't telling perl to output UTF-8. Now that I added > binmode(FILE, ':utf8') to my script, the problem is fixed. However, it > sounds like once I set binmode to UTF-8 everything will be interpreted as > such, even when the record is in MARC-8. Is that right? So this means that > I can only use my script with a file of records where all of them are > encoded in UTF-8. If I want to run the script against a file with all > MARC-8 encoding, then I'd need to remove the binmode line. > Sometimes it's easier to use the yaz-marcdump utility for MARC-8 to UTF-8 conversion (it's much faster): yaz-marcdump -f MARC-8 -t UTF-8 -o marc marc21.in >marc21.out http://www.indexdata.com/yaz/doc/yaz-marcdump.html Best regards, Saiful Amin DRTC, Bangalore
Re: printing UTF-8 encoded MARC records with as_usmarc
The problem was I wasn't telling perl to output UTF-8. Now that I added binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds like once I set binmode to UTF-8 everything will be interpreted as such, even when the record is in MARC-8. Is that right? So this means that I can only use my script with a file of records where all of them are encoded in UTF-8. If I want to run the script against a file with all MARC-8 encoding, then I'd need to remove the binmode line. It doesn't seem possible to say: if ($record->encoding() eq 'UTF-8' ) { binmode(FILE, ':utf8') ; FILE $record->as_usmarc() ; } else { print FILE $record->as_usmarc() ; } This will result in messing up the diacritics if a file has a mixture of records in MARC-8 and UTF-8. Is that correct? Thanks, Shelley - Original Message - > From: "William Dueber" > To: "Shelley Doljack" > Cc: perl4lib@perl.org > Sent: Monday, July 30, 2012 5:13:41 PM > Subject: Re: printing UTF-8 encoded MARC records with as_usmarc > First off, it's entirely possible that you have bad UTF-8 (perhaps > rogue MARC-8, perhaps just lousy characters) in your MARC. I know we > have plenty of that crap. > You need to tell perl that you'll be outputting UTF-8 using 'bincode' > binmode(FILE, ':utf8'); > In general, you'll want to do this to basically every file you open > for reading or writing. > A great overview of Perl and UTF-8 can be found at: > http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default > On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack < > sdolj...@stanford.edu > wrote: > > Hi, > > > I wrote a script that extracts marc records from a file given > > certain > > conditions and puts them in a new file. When my input record is > > correctly encoded in UTF-8 and I run my script from windows command > > prompt, this warning message appears: "Wide character in print at > > record_extraction.pl line 99" (the line in my script where I print > > to a new file using as_usmarc). I compared the extracted record > > before and after in MarcEdit and the diacritic was changed. I tried > > marcdump newfile.mrc to see what happens and I get this error: > > "utf8 > > \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176." > > When I run my extraction script again with MARC-8 encoded data then > > I don't have the same problem. > > > The basic outline of my script is: > > > my $batch = MARC::Batch->new('USMARC', $input_file); > > > while (my $record = $batch->next()) { > > > #do some checks > > > #if checks ok then > > > print FILE $record->as_usmarc(); > > > } > > > Do I need to add something that specifies to interpret the data as > > UTF-8? Does MARC::Record not handle UTF-8 at all? > > > Thanks, > > > Shelley > > > > > > Shelley Doljack > > > E-Resources Metadata Librarian > > > Metadata and Library Systems > > > Stanford University Libraries > > > sdolj...@stanford.edu > > > 650-725-0167 > > -- > Bill Dueber > Programmer -- Library Systems > University of Michigan
RE: printing UTF-8 encoded MARC records with as_usmarc
I recently came across a nasty issue with MARC::Record to do with output of Marc-8 encoded records. I was converting XML (which was in UTF-8) into MARC records using MARC::Record and had initially, and successfully, got good UTF-8 encoded MARC records out at the end. However, I then could not load them into our LMS, and realised it was going to be easier at the LMS end if the records were presented in MARC-8. While the Perl modules largely worked and I got the right MARC-8 representation out at the end, the record length and the field offsets and lengths in the directory got in a real mess, because the top-bit-set characters in MARC-8 got counted as though they were code-points 0x80 to 0xFF encoded as two bytes of UTF-8. I found a solution by hackily recalculating the lengths when needed, but I thought I'd mention it as the thread has touched on this area. Matthew -- Matthew Phillips Electronic Systems Librarian, Durham University Durham University Library, Stockton Road, Durham, DH1 3LY +44 (0)191 334 2941 > On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack > wrote: > > > Hi, > > > > I wrote a script that extracts marc records from a file given certain > > conditions and puts them in a new file. When my input record is correctly > > encoded in UTF-8 and I run my script from windows command prompt, this > > warning message appears: "Wide character in print at > record_extraction.plline 99" (the line in my script where I print to a new > file > using > > as_usmarc). I compared the extracted record before and after in MarcEdit > > and the diacritic was changed. I tried marcdump newfile.mrc to see what > > happens and I get this error: "utf8 \xF4 does not map to Unicode at > > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again > > with MARC-8 encoded data then I don't have the same problem. > > > > The basic outline of my script is: > > > > my $batch = MARC::Batch->new('USMARC', $input_file); > > > > while (my $record = $batch->next()) { > > #do some checks > > #if checks ok then > > print FILE $record->as_usmarc(); > > } > > > > Do I need to add something that specifies to interpret the data as UTF-8? > > Does MARC::Record not handle UTF-8 at all? > > > > Thanks, > > Shelley > > > > > > Shelley Doljack > > E-Resources Metadata Librarian > > Metadata and Library Systems > > Stanford University Libraries > > sdolj...@stanford.edu > > 650-725-0167 > >
RE: printing UTF-8 encoded MARC records with as_usmarc
I just recently came across this presentation which lays out pretty much all the issues with Unicode in perl, and makes some recommendations for best practices. You may find some general insight into the whole situation by going over it. http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-perl-workshop-2012 /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm -Original Message- From: William Dueber [mailto:dueb...@umich.edu] Sent: Monday, July 30, 2012 8:14 PM To: Shelley Doljack Cc: perl4lib@perl.org Subject: Re: printing UTF-8 encoded MARC records with as_usmarc First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack wrote: > Hi, > > I wrote a script that extracts marc records from a file given certain > conditions and puts them in a new file. When my input record is correctly > encoded in UTF-8 and I run my script from windows command prompt, this > warning message appears: "Wide character in print at record_extraction.plline > 99" (the line in my script where I print to a new file using > as_usmarc). I compared the extracted record before and after in MarcEdit > and the diacritic was changed. I tried marcdump newfile.mrc to see what > happens and I get this error: "utf8 \xF4 does not map to Unicode at > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again > with MARC-8 encoded data then I don't have the same problem. > > The basic outline of my script is: > > my $batch = MARC::Batch->new('USMARC', $input_file); > > while (my $record = $batch->next()) { > #do some checks > #if checks ok then > print FILE $record->as_usmarc(); > } > > Do I need to add something that specifies to interpret the data as UTF-8? > Does MARC::Record not handle UTF-8 at all? > > Thanks, > Shelley > > > Shelley Doljack > E-Resources Metadata Librarian > Metadata and Library Systems > Stanford University Libraries > sdolj...@stanford.edu > 650-725-0167 > -- Bill Dueber Programmer -- Library Systems University of Michigan
Re: printing UTF-8 encoded MARC records with as_usmarc
First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack wrote: > Hi, > > I wrote a script that extracts marc records from a file given certain > conditions and puts them in a new file. When my input record is correctly > encoded in UTF-8 and I run my script from windows command prompt, this > warning message appears: "Wide character in print at record_extraction.plline > 99" (the line in my script where I print to a new file using > as_usmarc). I compared the extracted record before and after in MarcEdit > and the diacritic was changed. I tried marcdump newfile.mrc to see what > happens and I get this error: "utf8 \xF4 does not map to Unicode at > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again > with MARC-8 encoded data then I don't have the same problem. > > The basic outline of my script is: > > my $batch = MARC::Batch->new('USMARC', $input_file); > > while (my $record = $batch->next()) { > #do some checks > #if checks ok then > print FILE $record->as_usmarc(); > } > > Do I need to add something that specifies to interpret the data as UTF-8? > Does MARC::Record not handle UTF-8 at all? > > Thanks, > Shelley > > > Shelley Doljack > E-Resources Metadata Librarian > Metadata and Library Systems > Stanford University Libraries > sdolj...@stanford.edu > 650-725-0167 > -- Bill Dueber Programmer -- Library Systems University of Michigan