subject:"RE\: reading and writing of utf\-8 with marc\:\:batch"

Re: reading and writing of utf-8 with marc::batch [resolved; gigo]

2013-03-28 Thread Eric Lease Morgan


Thank you for all the input, and I think I have resolved my particular issue. 
Battle won. War still raging.

Using the script suggested by Galen as an starting point, I wrote the following 
hack outputting integers denoting MARC records containing non-UTF-8 characters, 
but the script output nothing; all the data in all of my records was encoded as 
UTF-8:

  #!/usr/bin/perl

  # require
  use strict;
  use Encode;

  # initialize
  binmode STDIN, ":bytes";
  $/= "\035"; 
  my $i = 0;

  # read STDIN
  while ( <> ) {

  # increment
  $i++;

  # check validity
  eval { my $utf8str = &Encode::is_utf8( $_, Encode::FB_CROAK ); };

  # check for error
  if ( $@ ) { print "Record $i contains non-UTF-8 characters\n"; }

  }

  # done
  exit;


Since all of the data in all of my records was UTF-8, then all of the leaders 
of all of the records need to have a value of "a" set in position #9 of the 
leader. So I wrote the following hack (circumventing MARC::Batch):

  #!/usr/bin/perl

  # require
  use strict;

  # initialize
  binmode STDIN,  ":bytes";
  binmode STDOUT, ":bytes";
  $/ = "\035"; 

  # loop through the input
  while ( <> ) {

  # do the work and output
  substr( $_, 9, 1 ) = "a";
  print $_;

  }

  # done
  exit;


I then fed the output of my fix routine to my indexing routing, and all of my 
problems seemed to go away. GIGO?

I'm still not sure, but I think deep within MARC::Batch some sort of encoding 
is observed, honored, and output. And when the denoted encoding is not true and 
things like binmode( FILE, ":utf8" ) get called, output gets munged. Again, I'm 
not sure. It is almost exhausting.


-- 
Eric Morgan
University of Notre Dame

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-28 Thread Ashley Sanders

Eric,

> How can I figure out whether or not a MARC record contains ONLY characters 
> from the UTF-8 character set?

You can use a regex to check if a string is utf-8. There are various examples
floating around the internet. An example is the one here:

   http://www.w3.org/International/questions/qa-forms-utf-8

You'll need to add the MARC control characters ^_, ^^, and ^] to the ASCII part
of the expression in the above page. (I think the w3c example is aimed at XML1.0
in which the MARC control characters are not allowed.)

Ashley.
--
Ashley Sanders a.sand...@manchester.ac.uk
http://copac.ac.uk -- A Mimas service funded by JISC at the University of 
Manchester

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Galen Charlton

Hi,


On Wed, Mar 27, 2013 at 2:11 PM, Eric Lease Morgan  wrote:

> Put another way, how can I determine whether or not position #9 of a given
> MARC leader is accurate? If position #9 is an "a", then how can I read the
> balance of the record to determine whether or not all the characters really
> and truly are UTF-8 encoded?
>

The following program will read a file of MARC records from standard input
and classify each as either being valid UTF-8 or not.

___START
#!/usr/bin/perl

use Encode;

binmode STDIN, ':bytes';

$/ = "\035"; # MARC record terminator
my $i = 0;
while (<>) {
$i++;
my $bytes = $_;
eval {
my $utf8str = Encode::decode('UTF-8', $bytes, Encode::FB_CROAK);
};
if ($@) {
print "Record $i is valid UTF-8\n";
} else {
print "Record $i definitely not valid UTF-8\n";
}
}
___END

Regards,

Galen
-- 
Galen Charlton
gmcha...@gmail.com

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Shelley Doljack

I use MarcEdit to view records and check if the mnemonic form of a diacritic 
(e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best 
way I've come up with so far. MarcEdit is pretty good at guessing what the 
character encoding is without relying on the LDR/09 value. I think there are 
some perl modules you could use that "guess" what the encoding is of a 
character but I've never used them. I'm interested in finding out other methods 
(preferably automated) for detecting wrong or mixed character encodings in a 
MARC record. 

Shelley

- Original Message -
> From: "Eric Lease Morgan" 
> To: perl4lib@perl.org
> Sent: Wednesday, March 27, 2013 2:11:26 PM
> Subject: Re: reading and writing of utf-8 with marc::batch [double encoding]
> 
> 
> On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan 
> wrote:
> 
> > When it calls as_usmarc, I think MARC::Batch tries to honor the
> > value set in position #9 of the leader. In other words, if the
> > leader is empty, then it tries to output records as MARC-8, and
> > when the leader is a value of "a", it tries to encode the data as
> > UTF-8.
> 
> How can I figure out whether or not a MARC record contains ONLY
> characters from the UTF-8 character set?
> 
> Put another way, how can I determine whether or not position #9 of a
> given MARC leader is accurate? If position #9 is an "a", then how
> can I read the balance of the record to determine whether or not all
> the characters really and truly are UTF-8 encoded?
> 
> --
> Eric "This Is Almost Too Much For Me" Morgan
> 
>

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Eric Lease Morgan

On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan  wrote:

> When it calls as_usmarc, I think MARC::Batch tries to honor the value set in 
> position #9 of the leader. In other words, if the leader is empty, then it 
> tries to output records as MARC-8, and when the leader is a value of "a", it 
> tries to encode the data as UTF-8.

How can I figure out whether or not a MARC record contains ONLY characters from 
the UTF-8 character set?

Put another way, how can I determine whether or not position #9 of a given MARC 
leader is accurate? If position #9 is an "a", then how can I read the balance 
of the record to determine whether or not all the characters really and truly 
are UTF-8 encoded?

--
Eric "This Is Almost Too Much For Me" Morgan

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Eric Lease Morgan

On Mar 27, 2013, at 2:20 PM, Eric Lease Morgan  wrote:

> A number of people have alluded to the problem of double encoding, and I'm 
> beginning to think this is true. 

When it calls as_usmarc, I think MARC::Batch tries to honor the value set in 
position #9 of the leader. In other words, if the leader is empty, then it 
tries to output records as MARC-8, and when the leader is a value of "a", it 
tries to encode the data as UTF-8.

If I employ binmode( OUTFILE, ":utf8"), and the output is already UTF-8, then 
double encoding happens. 

To test this theory, I fixed a number records in my batch. Specifically, I 
inserted the letter "a" in position #9 of the leader. I then ran my processing 
file WITHOUT the employment of binmode, and my output was correct. For example, 
look at all the glorious characters in the following URL:

  http://www.catholicresearch.net/vufind/Record/undmarc_001906501

--
Eric Lease Morgan
Hesburgh Libraries
University of Notre Dame

574/631-8604

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Galen Charlton

Hi,

On Wed, Mar 27, 2013 at 11:20 AM, Eric Lease Morgan  wrote:

> I have isolated a number of problem records. They all contain diacritics,
> but they do not have an "a" in position #9 of the leader --
> http://dh.crc.nd.edu/tmp/original.marc  Can someone verify that the file
> contains UTF-8 characters for me?
>

I've eyeballed it and confirm that the encoding of that file is UTF-8.

For these same records I have also added an "a" in position #9 and created
> a similar file -- http://dh.crc.nd.edu/tmp/fixed.marc

I've looked this over as well.

> Is it true that original.marc is not denoted correctly, but fixed.marc is
> denoted correctly?
>

Yes.  The Leader/09 must be set to 'a' if the character encoding in use is
UTF-8.

Regards,

Galen
-- 
Galen Charlton
gmcha...@gmail.com

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Eric Lease Morgan


A number of people have alluded to the problem of double encoding, and I'm 
beginning to think this is true. 

I have isolated a number of problem records. They all contain diacritics, but 
they do not have an "a" in position #9 of the leader -- 
http://dh.crc.nd.edu/tmp/original.marc  Can someone verify that the file 
contains UTF-8 characters for me?

For these same records I have also added an "a" in position #9 and created a 
similar file -- http://dh.crc.nd.edu/tmp/fixed.marc  

Is it true that original.marc is not denoted correctly, but fixed.marc is 
denoted correctly?

-- 
Eric Morgan

Re: reading and writing of utf-8 with marc::batch

2013-03-27 Thread Shelley Doljack

Whenever I see characters like Ã©, I consult this website 
http://www.i18nqa.com/debug/bug-utf-8-latin1.html to help me figure out what's 
going on. You might find it helpful too.

Shelley

- Original Message -
> From: "Eric Lease Morgan" 
> To: perl4lib@perl.org
> Sent: Tuesday, March 26, 2013 1:22:03 PM
> Subject: reading and writing of utf-8 with marc::batch
> 
> 
> For the life of me I can't figure out how to do reading and writing
> of UTF-8 with MARC::Batch.
> 
> I have a UTF-8 encoded file of MARC records. Dumping the records and
> greping for a particular string illustrates the validity:
> 
>   $ marcdump und.marc | grep Sainte-Face
>   und.marc
>   1000 records
>   2000 records
>   3000 records
>   4000 records
>   5000 records
>   6000 records
>   7000 records
>   8000 records
>   9000 records
>   1 records
>   11000 records
>   12000 records
>   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
>   610 20 _aArchiconfrérie de la Sainte-Face
>   13000 records
>   $
> 
> I then run a Perl script that simply reads each record and dumps it
> to STDOUT. Notice how I define both my input and output as UTF-8:
> 
>   #!/shared/perl/current/bin/perl
> 
>   # configure
>   use constant MARC => './und.marc';
> 
>   # require
>   use strict;
>   use MARC::Batch;
> 
>   # initialize
>   binmode ( MARC, ":utf8" );
>   my $batch = MARC::Batch->new( 'USMARC', MARC );
>   $batch->strict_off;
>   $batch->warnings_off;
>   binmode( STDOUT, ":utf8" );
> 
>   # read & write
>   while ( my $marc = $batch->next ) { print $marc->as_usmarc }
> 
>   # done
>   exit;
> 
> But my output is munged:
> 
>   $ ./marc.pl > und.mrc
>   $ marcdump und.mrc | grep Sainte-Face
>   und.mrc
>   1000 records
>   2000 records
>   3000 records
>   4000 records
>   5000 records
>   6000 records
>   7000 records
>   8000 records
>   9000 records
>   1 records
>   11000 records
>   12000 records
>   245 00 _aAnnales de l'ArchiconfrÃ©rie de la Sainte-Face
>   610_aArchiconfrÃ©rie de la Sainte-Face
>   13000 records
>   $
> 
> What am I doing wrong!?
> 
> --
> Eric Lease Morgan
> University of Notre Dame
> 
> 574/631-8604
> 
> 
> 
> 

-- 
Shelley Doljack  
E-Resources Metadata Librarian 
Metadata Department
Stanford University Libraries
sdolj...@stanford.edu
650-725-0167

Re: reading and writing of utf-8 with marc::batch [terminal]

2013-03-27 Thread Galen Charlton

Hi Eric,

On Wed, Mar 27, 2013 at 10:26 AM, Eric Lease Morgan  wrote:

> While I'm not positive my terminal is doing UTF-8, I think it is. When I
> dump in the beginning the output to the terminal is correct. After I run my
> script the output to the same terminal is incorrect.
>

Would you be willing to put up a link to your MARC file?  I'm willing to
take a quick look to see if I can reproduce the problem you're seeing.

Regards,

Galen
-- 
Galen Charlton
gmcha...@gmail.com

Re: reading and writing of utf-8 with marc::batch [terminal]

2013-03-27 Thread Eric Lease Morgan


On Mar 26, 2013, at 5:57 PM, Leif Andersson  wrote:

> my first guess would be your terminal is not utf8.

While I'm not positive my terminal is doing UTF-8, I think it is. When I dump 
in the beginning the output to the terminal is correct. After I run my script 
the output to the same terminal is incorrect. 

--
Eric Lease Morgan

Re: reading and writing of utf-8 with marc::batch

2013-03-27 Thread Galen Charlton

Hi,

On Wed, Mar 27, 2013 at 7:01 AM, Jon Gorman wrote:

> One piece of advice is not to trust the terminal directly but pipe
> into xxd. (And if possible, just try transforming the offending
> record).  Or use yaz-marcdump -v, which will also give the hex if I
> remember correctly.  (If it's c3 a9 in both cases, you know the
> terminal is at fault)
>

Another trick is to pipe the output through less with the LESSCHARSET
environment variable set to 'ascii'.  Bytes whose value is less than 32 or
greater than 136 will be displayed as reverse-video hexadecimal numbers,
e.g.,

Garci<81>a Ma<81>rquez, Gabriel,

Regards,

Galen
-- 
Galen Charlton
gmcha...@gmail.com

Re: reading and writing of utf-8 with marc::batch

2013-03-27 Thread Jon Gorman

Ok, I can't claim to be an expert, but from my own experience, I'd say
Paul is very likely right about double-encoding occuring.  However,
the question ends up being where that happens, and in this case I
suspect how MARC::Batch will work could depend heavily on what version
of perl you're running and what version of MARC::Batch you're running.
That might help too (I'd try to be on a later version of perl, the
latest of Batch::MARC ). (It also depends on how you're generating the
marc record, which isn't really clear to me.

It could also be that the leaders or the terminal as others have suggested.

One piece of advice is not to trust the terminal directly but pipe
into xxd. (And if possible, just try transforming the offending
record).  Or use yaz-marcdump -v, which will also give the hex if I
remember correctly.  (If it's c3 a9 in both cases, you know the
terminal is at fault)

Then try doing that without the binmode, w/ binmode :raw, etc.

Jon Gorman

RE: reading and writing of utf-8 with marc::batch

2013-03-27 Thread KREYCHE, MICHAEL

Eric--

I'm with Leif. The output you got looks like utf-8 displayed on a terminal that 
doesn't support it. Whether you need to fix the terminal display is another 
matter--I've never felt compelled to do so. 

Anyway, I think you can now sign yourself "Eric Did-it-right-the-first-time 
Morgan!"

Mike

> -Original Message-
> From: Leif Andersson [mailto:leif.anders...@sub.su.se]
> Sent: Tuesday, March 26, 2013 5:57 PM
> To: Eric Lease Morgan; perl4lib@perl.org
> Subject: Re: reading and writing of utf-8 with marc::batch
> 
> Hi Eric,
> 
> my first guess would be your terminal is not utf8.
> If you comment out
> #binmode( STDOUT, ":utf8" );
> and that does the trick, then you can start looking for how to change
> your terminal settings.
> (And that can sometimes be a rather frustrating task, I'm afraid)
> 
> /Leif Andersson
> Stockholm UL
> 
> Från: Eric Lease Morgan [emor...@nd.edu]
> Skickat: den 26 mars 2013 21:22
> Till: perl4lib@perl.org
> Ämne: reading and writing of utf-8 with marc::batch
> 
> For the life of me I can't figure out how to do reading and writing of
> UTF-8 with MARC::Batch.
> 
> I have a UTF-8 encoded file of MARC records. Dumping the records and
> greping for a particular string illustrates the validity:
> 
>   $ marcdump und.marc | grep Sainte-Face
>   und.marc
>   1000 records
>   2000 records
>   3000 records
>   4000 records
>   5000 records
>   6000 records
>   7000 records
>   8000 records
>   9000 records
>   1 records
>   11000 records
>   12000 records
>   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
>   610 20 _aArchiconfrérie de la Sainte-Face
>   13000 records
>   $
> 
> I then run a Perl script that simply reads each record and dumps it to
> STDOUT. Notice how I define both my input and output as UTF-8:
> 
>   #!/shared/perl/current/bin/perl
> 
>   # configure
>   use constant MARC => './und.marc';
> 
>   # require
>   use strict;
>   use MARC::Batch;
> 
>   # initialize
>   binmode ( MARC, ":utf8" );
>   my $batch = MARC::Batch->new( 'USMARC', MARC );
>   $batch->strict_off;
>   $batch->warnings_off;
>   binmode( STDOUT, ":utf8" );
> 
>   # read & write
>   while ( my $marc = $batch->next ) { print $marc->as_usmarc }
> 
>   # done
>   exit;
> 
> But my output is munged:
> 
>   $ ./marc.pl > und.mrc
>   $ marcdump und.mrc | grep Sainte-Face
>   und.mrc
>   1000 records
>   2000 records
>   3000 records
>   4000 records
>   5000 records
>   6000 records
>   7000 records
>   8000 records
>   9000 records
>   1 records
>   11000 records
>   12000 records
>   245 00 _aAnnales de l'ArchiconfrÃ©rie de la Sainte-Face
>   610_aArchiconfrÃ©rie de la Sainte-Face
>   13000 records
>   $
> 
> What am I doing wrong!?
> 
> --
> Eric Lease Morgan
> University of Notre Dame
> 
> 574/631-8604

Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Leif Andersson

Hi Eric,

my first guess would be your terminal is not utf8.
If you comment out
#binmode( STDOUT, ":utf8" );
and that does the trick, then you can start looking for how to change your 
terminal settings.
(And that can sometimes be a rather frustrating task, I'm afraid)

/Leif Andersson
Stockholm UL

Från: Eric Lease Morgan [emor...@nd.edu]
Skickat: den 26 mars 2013 21:22
Till: perl4lib@perl.org
Ämne: reading and writing of utf-8 with marc::batch

For the life of me I can't figure out how to do reading and writing of UTF-8 
with MARC::Batch.

I have a UTF-8 encoded file of MARC records. Dumping the records and greping 
for a particular string illustrates the validity:

  $ marcdump und.marc | grep Sainte-Face
  und.marc
  1000 records
  2000 records
  3000 records
  4000 records
  5000 records
  6000 records
  7000 records
  8000 records
  9000 records
  1 records
  11000 records
  12000 records
  245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
  610 20 _aArchiconfrérie de la Sainte-Face
  13000 records
  $

I then run a Perl script that simply reads each record and dumps it to STDOUT. 
Notice how I define both my input and output as UTF-8:

  #!/shared/perl/current/bin/perl

  # configure
  use constant MARC => './und.marc';

  # require
  use strict;
  use MARC::Batch;

  # initialize
  binmode ( MARC, ":utf8" );
  my $batch = MARC::Batch->new( 'USMARC', MARC );
  $batch->strict_off;
  $batch->warnings_off;
  binmode( STDOUT, ":utf8" );

  # read & write
  while ( my $marc = $batch->next ) { print $marc->as_usmarc }

  # done
  exit;

But my output is munged:

  $ ./marc.pl > und.mrc
  $ marcdump und.mrc | grep Sainte-Face
  und.mrc
  1000 records
  2000 records
  3000 records
  4000 records
  5000 records
  6000 records
  7000 records
  8000 records
  9000 records
  1 records
  11000 records
  12000 records
  245 00 _aAnnales de l'ArchiconfrÃ©rie de la Sainte-Face
  610_aArchiconfrÃ©rie de la Sainte-Face
  13000 records
  $

What am I doing wrong!?

--
Eric Lease Morgan
University of Notre Dame

574/631-8604

Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Timothy Prettyman

Do your records have the utf8 encoding byte set  in the LDR? (Byte 9 should
be 'a' for utf8).

-Tim

Timothy Prettyman
University of Michigan LIbrary/LIT


On Tue, Mar 26, 2013 at 4:22 PM, Eric Lease Morgan  wrote:

>
> For the life of me I can't figure out how to do reading and writing of
> UTF-8 with MARC::Batch.
>
> I have a UTF-8 encoded file of MARC records. Dumping the records and
> greping for a particular string illustrates the validity:
>
>   $ marcdump und.marc | grep Sainte-Face
>   und.marc
>   1000 records
>   2000 records
>   3000 records
>   4000 records
>   5000 records
>   6000 records
>   7000 records
>   8000 records
>   9000 records
>   1 records
>   11000 records
>   12000 records
>   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
>   610 20 _aArchiconfrérie de la Sainte-Face
>   13000 records
>   $
>
> I then run a Perl script that simply reads each record and dumps it to
> STDOUT. Notice how I define both my input and output as UTF-8:
>
>   #!/shared/perl/current/bin/perl
>
>   # configure
>   use constant MARC => './und.marc';
>
>   # require
>   use strict;
>   use MARC::Batch;
>
>   # initialize
>   binmode ( MARC, ":utf8" );
>   my $batch = MARC::Batch->new( 'USMARC', MARC );
>   $batch->strict_off;
>   $batch->warnings_off;
>   binmode( STDOUT, ":utf8" );
>
>   # read & write
>   while ( my $marc = $batch->next ) { print $marc->as_usmarc }
>
>   # done
>   exit;
>
> But my output is munged:
>
>   $ ./marc.pl > und.mrc
>   $ marcdump und.mrc | grep Sainte-Face
>   und.mrc
>   1000 records
>   2000 records
>   3000 records
>   4000 records
>   5000 records
>   6000 records
>   7000 records
>   8000 records
>   9000 records
>   1 records
>   11000 records
>   12000 records
>   245 00 _aAnnales de l'ArchiconfrÃ©rie de la Sainte-Face
>   610_aArchiconfrÃ©rie de la Sainte-Face
>   13000 records
>   $
>
> What am I doing wrong!?
>
> --
> Eric Lease Morgan
> University of Notre Dame
>
> 574/631-8604
>
>
>
>

Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Paul Hoffman

On Tue, Mar 26, 2013 at 04:22:03PM -0400, Eric Lease Morgan wrote:
> For the life of me I can't figure out how to do reading and writing of 
> UTF-8 with MARC::Batch.
> 
> I have a UTF-8 encoded file of MARC records. Dumping the records and 
> greping for a particular string illustrates the validity:
> 
>   $ marcdump und.marc | grep Sainte-Face

What is marcdump?

>   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
>   610 20 _aArchiconfrérie de la Sainte-Face
>   13000 records
>   $ 
> 
> I then run a Perl script that simply reads each record and dumps it to 
> STDOUT. Notice how I define both my input and output as UTF-8:

Try *not* calling binmode and see what happens.  Or just call 
binmode(MARC) without the ':utf8' layer.

>   245 00 _aAnnales de l'ArchiconfrÃ©rie de la Sainte-Face
>   610_aArchiconfrÃ©rie de la Sainte-Face
>   13000 records
>   $

This looks like double-encoding:

  6c 27 41 72 63 68 69 63  6f 6e 66 72 c3 83 c2 a9  |l'ArchiconfrÃ.©|
0010  72 69 65  |rie|

LATIN SMALL LETTER E WITH ACUTE is supposed to be c3 a9 (as it is in the 
first marcdump output) not c3 83 c2 a9.

Paul.

-- 
Paul Hoffman

Re: reading and writing of utf-8 with marc::batch [resolved; gigo]

Re: reading and writing of utf-8 with marc::batch [double encoding]

Re: reading and writing of utf-8 with marc::batch [double encoding]

Re: reading and writing of utf-8 with marc::batch [double encoding]

Re: reading and writing of utf-8 with marc::batch [double encoding]

Re: reading and writing of utf-8 with marc::batch [double encoding]

Re: reading and writing of utf-8 with marc::batch [double encoding]

Re: reading and writing of utf-8 with marc::batch [double encoding]

Re: reading and writing of utf-8 with marc::batch

Re: reading and writing of utf-8 with marc::batch [terminal]

Re: reading and writing of utf-8 with marc::batch [terminal]

Re: reading and writing of utf-8 with marc::batch

Re: reading and writing of utf-8 with marc::batch

RE: reading and writing of utf-8 with marc::batch

Re: reading and writing of utf-8 with marc::batch

Re: reading and writing of utf-8 with marc::batch

Re: reading and writing of utf-8 with marc::batch

17 matches

Site Navigation

Mail list logo

Footer information