Re: reading and writing of utf-8 with marc::batch

Paul Hoffman Tue, 26 Mar 2013 14:11:37 -0700

On Tue, Mar 26, 2013 at 04:22:03PM -0400, Eric Lease Morgan wrote:
> For the life of me I can't figure out how to do reading and writing of 
> UTF-8 with MARC::Batch.
> 
> I have a UTF-8 encoded file of MARC records. Dumping the records and 
> greping for a particular string illustrates the validity:
> 
>   $ marcdump und.marc | grep Sainte-Face


What is marcdump?

>   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
>   610 20 _aArchiconfrérie de la Sainte-Face
>   13000 records
>   $ 
> 
> I then run a Perl script that simply reads each record and dumps it to 
> STDOUT. Notice how I define both my input and output as UTF-8:

Try *not* calling binmode and see what happens.  Or just call 
binmode(MARC) without the ':utf8' layer.

>   245 00 _aAnnales de l'ArchiconfrÃ©rie de la Sainte-Face
>   610    _aArchiconfrÃ©rie de la Sainte-Face
>   13000 records
>   $

This looks like double-encoding:

00000000  6c 27 41 72 63 68 69 63  6f 6e 66 72 c3 83 c2 a9  |l'ArchiconfrÃ.©|
00000010  72 69 65                                          |rie|

LATIN SMALL LETTER E WITH ACUTE is supposed to be c3 a9 (as it is in the 
first marcdump output) not c3 83 c2 a9.

Paul.

-- 
Paul Hoffman <nkui...@nkuitse.com>

Re: reading and writing of utf-8 with marc::batch

Reply via email to