Re: reading and writing of utf-8 with marc::batch
On Tue, Mar 26, 2013 at 04:22:03PM -0400, Eric Lease Morgan wrote: For the life of me I can't figure out how to do reading and writing of UTF-8 with MARC::Batch. I have a UTF-8 encoded file of MARC records. Dumping the records and greping for a particular string illustrates the validity: $ marcdump und.marc | grep Sainte-Face What is marcdump? 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610 20 _aArchiconfrérie de la Sainte-Face 13000 records $ I then run a Perl script that simply reads each record and dumps it to STDOUT. Notice how I define both my input and output as UTF-8: Try *not* calling binmode and see what happens. Or just call binmode(MARC) without the ':utf8' layer. 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610_aArchiconfrérie de la Sainte-Face 13000 records $ This looks like double-encoding: 6c 27 41 72 63 68 69 63 6f 6e 66 72 c3 83 c2 a9 |l'ArchiconfrÃ.©| 0010 72 69 65 |rie| LATIN SMALL LETTER E WITH ACUTE is supposed to be c3 a9 (as it is in the first marcdump output) not c3 83 c2 a9. Paul. -- Paul Hoffman nkui...@nkuitse.com
Re: reading and writing of utf-8 with marc::batch
Do your records have the utf8 encoding byte set in the LDR? (Byte 9 should be 'a' for utf8). -Tim Timothy Prettyman University of Michigan LIbrary/LIT On Tue, Mar 26, 2013 at 4:22 PM, Eric Lease Morgan emor...@nd.edu wrote: For the life of me I can't figure out how to do reading and writing of UTF-8 with MARC::Batch. I have a UTF-8 encoded file of MARC records. Dumping the records and greping for a particular string illustrates the validity: $ marcdump und.marc | grep Sainte-Face und.marc 1000 records 2000 records 3000 records 4000 records 5000 records 6000 records 7000 records 8000 records 9000 records 1 records 11000 records 12000 records 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610 20 _aArchiconfrérie de la Sainte-Face 13000 records $ I then run a Perl script that simply reads each record and dumps it to STDOUT. Notice how I define both my input and output as UTF-8: #!/shared/perl/current/bin/perl # configure use constant MARC = './und.marc'; # require use strict; use MARC::Batch; # initialize binmode ( MARC, :utf8 ); my $batch = MARC::Batch-new( 'USMARC', MARC ); $batch-strict_off; $batch-warnings_off; binmode( STDOUT, :utf8 ); # read write while ( my $marc = $batch-next ) { print $marc-as_usmarc } # done exit; But my output is munged: $ ./marc.pl und.mrc $ marcdump und.mrc | grep Sainte-Face und.mrc 1000 records 2000 records 3000 records 4000 records 5000 records 6000 records 7000 records 8000 records 9000 records 1 records 11000 records 12000 records 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610_aArchiconfrérie de la Sainte-Face 13000 records $ What am I doing wrong!? -- Eric Lease Morgan University of Notre Dame 574/631-8604
Re: reading and writing of utf-8 with marc::batch
Hi Eric, my first guess would be your terminal is not utf8. If you comment out #binmode( STDOUT, :utf8 ); and that does the trick, then you can start looking for how to change your terminal settings. (And that can sometimes be a rather frustrating task, I'm afraid) /Leif Andersson Stockholm UL Från: Eric Lease Morgan [emor...@nd.edu] Skickat: den 26 mars 2013 21:22 Till: perl4lib@perl.org Ämne: reading and writing of utf-8 with marc::batch For the life of me I can't figure out how to do reading and writing of UTF-8 with MARC::Batch. I have a UTF-8 encoded file of MARC records. Dumping the records and greping for a particular string illustrates the validity: $ marcdump und.marc | grep Sainte-Face und.marc 1000 records 2000 records 3000 records 4000 records 5000 records 6000 records 7000 records 8000 records 9000 records 1 records 11000 records 12000 records 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610 20 _aArchiconfrérie de la Sainte-Face 13000 records $ I then run a Perl script that simply reads each record and dumps it to STDOUT. Notice how I define both my input and output as UTF-8: #!/shared/perl/current/bin/perl # configure use constant MARC = './und.marc'; # require use strict; use MARC::Batch; # initialize binmode ( MARC, :utf8 ); my $batch = MARC::Batch-new( 'USMARC', MARC ); $batch-strict_off; $batch-warnings_off; binmode( STDOUT, :utf8 ); # read write while ( my $marc = $batch-next ) { print $marc-as_usmarc } # done exit; But my output is munged: $ ./marc.pl und.mrc $ marcdump und.mrc | grep Sainte-Face und.mrc 1000 records 2000 records 3000 records 4000 records 5000 records 6000 records 7000 records 8000 records 9000 records 1 records 11000 records 12000 records 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610_aArchiconfrérie de la Sainte-Face 13000 records $ What am I doing wrong!? -- Eric Lease Morgan University of Notre Dame 574/631-8604