Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Paul Hoffman
On Tue, Mar 26, 2013 at 04:22:03PM -0400, Eric Lease Morgan wrote:
 For the life of me I can't figure out how to do reading and writing of 
 UTF-8 with MARC::Batch.
 
 I have a UTF-8 encoded file of MARC records. Dumping the records and 
 greping for a particular string illustrates the validity:
 
   $ marcdump und.marc | grep Sainte-Face

What is marcdump?

   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
   610 20 _aArchiconfrérie de la Sainte-Face
   13000 records
   $ 
 
 I then run a Perl script that simply reads each record and dumps it to 
 STDOUT. Notice how I define both my input and output as UTF-8:

Try *not* calling binmode and see what happens.  Or just call 
binmode(MARC) without the ':utf8' layer.

   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
   610_aArchiconfrérie de la Sainte-Face
   13000 records
   $

This looks like double-encoding:

  6c 27 41 72 63 68 69 63  6f 6e 66 72 c3 83 c2 a9  |l'ArchiconfrÃ.©|
0010  72 69 65  |rie|

LATIN SMALL LETTER E WITH ACUTE is supposed to be c3 a9 (as it is in the 
first marcdump output) not c3 83 c2 a9.

Paul.

-- 
Paul Hoffman nkui...@nkuitse.com


Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Timothy Prettyman
Do your records have the utf8 encoding byte set  in the LDR? (Byte 9 should
be 'a' for utf8).

-Tim

Timothy Prettyman
University of Michigan LIbrary/LIT


On Tue, Mar 26, 2013 at 4:22 PM, Eric Lease Morgan emor...@nd.edu wrote:


 For the life of me I can't figure out how to do reading and writing of
 UTF-8 with MARC::Batch.

 I have a UTF-8 encoded file of MARC records. Dumping the records and
 greping for a particular string illustrates the validity:

   $ marcdump und.marc | grep Sainte-Face
   und.marc
   1000 records
   2000 records
   3000 records
   4000 records
   5000 records
   6000 records
   7000 records
   8000 records
   9000 records
   1 records
   11000 records
   12000 records
   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
   610 20 _aArchiconfrérie de la Sainte-Face
   13000 records
   $

 I then run a Perl script that simply reads each record and dumps it to
 STDOUT. Notice how I define both my input and output as UTF-8:

   #!/shared/perl/current/bin/perl

   # configure
   use constant MARC = './und.marc';

   # require
   use strict;
   use MARC::Batch;

   # initialize
   binmode ( MARC, :utf8 );
   my $batch = MARC::Batch-new( 'USMARC', MARC );
   $batch-strict_off;
   $batch-warnings_off;
   binmode( STDOUT, :utf8 );

   # read  write
   while ( my $marc = $batch-next ) { print $marc-as_usmarc }

   # done
   exit;

 But my output is munged:

   $ ./marc.pl  und.mrc
   $ marcdump und.mrc | grep Sainte-Face
   und.mrc
   1000 records
   2000 records
   3000 records
   4000 records
   5000 records
   6000 records
   7000 records
   8000 records
   9000 records
   1 records
   11000 records
   12000 records
   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
   610_aArchiconfrérie de la Sainte-Face
   13000 records
   $

 What am I doing wrong!?

 --
 Eric Lease Morgan
 University of Notre Dame

 574/631-8604






Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Leif Andersson
Hi Eric,

my first guess would be your terminal is not utf8.
If you comment out
#binmode( STDOUT, :utf8 );
and that does the trick, then you can start looking for how to change your 
terminal settings.
(And that can sometimes be a rather frustrating task, I'm afraid)

/Leif Andersson
Stockholm UL

Från: Eric Lease Morgan [emor...@nd.edu]
Skickat: den 26 mars 2013 21:22
Till: perl4lib@perl.org
Ämne: reading and writing of utf-8 with marc::batch

For the life of me I can't figure out how to do reading and writing of UTF-8 
with MARC::Batch.

I have a UTF-8 encoded file of MARC records. Dumping the records and greping 
for a particular string illustrates the validity:

  $ marcdump und.marc | grep Sainte-Face
  und.marc
  1000 records
  2000 records
  3000 records
  4000 records
  5000 records
  6000 records
  7000 records
  8000 records
  9000 records
  1 records
  11000 records
  12000 records
  245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
  610 20 _aArchiconfrérie de la Sainte-Face
  13000 records
  $

I then run a Perl script that simply reads each record and dumps it to STDOUT. 
Notice how I define both my input and output as UTF-8:

  #!/shared/perl/current/bin/perl

  # configure
  use constant MARC = './und.marc';

  # require
  use strict;
  use MARC::Batch;

  # initialize
  binmode ( MARC, :utf8 );
  my $batch = MARC::Batch-new( 'USMARC', MARC );
  $batch-strict_off;
  $batch-warnings_off;
  binmode( STDOUT, :utf8 );

  # read  write
  while ( my $marc = $batch-next ) { print $marc-as_usmarc }

  # done
  exit;

But my output is munged:

  $ ./marc.pl  und.mrc
  $ marcdump und.mrc | grep Sainte-Face
  und.mrc
  1000 records
  2000 records
  3000 records
  4000 records
  5000 records
  6000 records
  7000 records
  8000 records
  9000 records
  1 records
  11000 records
  12000 records
  245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
  610_aArchiconfrérie de la Sainte-Face
  13000 records
  $

What am I doing wrong!?

--
Eric Lease Morgan
University of Notre Dame

574/631-8604