Re: reading and writing of utf-8 with marc::batch [resolved; gigo]
Thank you for all the input, and I think I have resolved my particular issue. Battle won. War still raging. Using the script suggested by Galen as an starting point, I wrote the following hack outputting integers denoting MARC records containing non-UTF-8 characters, but the script output nothing; all the data in all of my records was encoded as UTF-8: #!/usr/bin/perl # require use strict; use Encode; # initialize binmode STDIN, ":bytes"; $/= "\035"; my $i = 0; # read STDIN while ( <> ) { # increment $i++; # check validity eval { my $utf8str = &Encode::is_utf8( $_, Encode::FB_CROAK ); }; # check for error if ( $@ ) { print "Record $i contains non-UTF-8 characters\n"; } } # done exit; Since all of the data in all of my records was UTF-8, then all of the leaders of all of the records need to have a value of "a" set in position #9 of the leader. So I wrote the following hack (circumventing MARC::Batch): #!/usr/bin/perl # require use strict; # initialize binmode STDIN, ":bytes"; binmode STDOUT, ":bytes"; $/ = "\035"; # loop through the input while ( <> ) { # do the work and output substr( $_, 9, 1 ) = "a"; print $_; } # done exit; I then fed the output of my fix routine to my indexing routing, and all of my problems seemed to go away. GIGO? I'm still not sure, but I think deep within MARC::Batch some sort of encoding is observed, honored, and output. And when the denoted encoding is not true and things like binmode( FILE, ":utf8" ) get called, output gets munged. Again, I'm not sure. It is almost exhausting. -- Eric Morgan University of Notre Dame
Re: reading and writing of utf-8 with marc::batch [double encoding]
Eric, > How can I figure out whether or not a MARC record contains ONLY characters > from the UTF-8 character set? You can use a regex to check if a string is utf-8. There are various examples floating around the internet. An example is the one here: http://www.w3.org/International/questions/qa-forms-utf-8 You'll need to add the MARC control characters ^_, ^^, and ^] to the ASCII part of the expression in the above page. (I think the w3c example is aimed at XML1.0 in which the MARC control characters are not allowed.) Ashley. -- Ashley Sanders a.sand...@manchester.ac.uk http://copac.ac.uk -- A Mimas service funded by JISC at the University of Manchester
Re: reading and writing of utf-8 with marc::batch [double encoding]
Hi, On Wed, Mar 27, 2013 at 2:11 PM, Eric Lease Morgan wrote: > Put another way, how can I determine whether or not position #9 of a given > MARC leader is accurate? If position #9 is an "a", then how can I read the > balance of the record to determine whether or not all the characters really > and truly are UTF-8 encoded? > The following program will read a file of MARC records from standard input and classify each as either being valid UTF-8 or not. ___START #!/usr/bin/perl use Encode; binmode STDIN, ':bytes'; $/ = "\035"; # MARC record terminator my $i = 0; while (<>) { $i++; my $bytes = $_; eval { my $utf8str = Encode::decode('UTF-8', $bytes, Encode::FB_CROAK); }; if ($@) { print "Record $i is valid UTF-8\n"; } else { print "Record $i definitely not valid UTF-8\n"; } } ___END Regards, Galen -- Galen Charlton gmcha...@gmail.com
Re: reading and writing of utf-8 with marc::batch [double encoding]
I use MarcEdit to view records and check if the mnemonic form of a diacritic (e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best way I've come up with so far. MarcEdit is pretty good at guessing what the character encoding is without relying on the LDR/09 value. I think there are some perl modules you could use that "guess" what the encoding is of a character but I've never used them. I'm interested in finding out other methods (preferably automated) for detecting wrong or mixed character encodings in a MARC record. Shelley - Original Message - > From: "Eric Lease Morgan" > To: perl4lib@perl.org > Sent: Wednesday, March 27, 2013 2:11:26 PM > Subject: Re: reading and writing of utf-8 with marc::batch [double encoding] > > > On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan > wrote: > > > When it calls as_usmarc, I think MARC::Batch tries to honor the > > value set in position #9 of the leader. In other words, if the > > leader is empty, then it tries to output records as MARC-8, and > > when the leader is a value of "a", it tries to encode the data as > > UTF-8. > > How can I figure out whether or not a MARC record contains ONLY > characters from the UTF-8 character set? > > Put another way, how can I determine whether or not position #9 of a > given MARC leader is accurate? If position #9 is an "a", then how > can I read the balance of the record to determine whether or not all > the characters really and truly are UTF-8 encoded? > > -- > Eric "This Is Almost Too Much For Me" Morgan > >
Re: reading and writing of utf-8 with marc::batch [double encoding]
On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan wrote: > When it calls as_usmarc, I think MARC::Batch tries to honor the value set in > position #9 of the leader. In other words, if the leader is empty, then it > tries to output records as MARC-8, and when the leader is a value of "a", it > tries to encode the data as UTF-8. How can I figure out whether or not a MARC record contains ONLY characters from the UTF-8 character set? Put another way, how can I determine whether or not position #9 of a given MARC leader is accurate? If position #9 is an "a", then how can I read the balance of the record to determine whether or not all the characters really and truly are UTF-8 encoded? -- Eric "This Is Almost Too Much For Me" Morgan
Re: reading and writing of utf-8 with marc::batch [double encoding]
On Mar 27, 2013, at 2:20 PM, Eric Lease Morgan wrote: > A number of people have alluded to the problem of double encoding, and I'm > beginning to think this is true. When it calls as_usmarc, I think MARC::Batch tries to honor the value set in position #9 of the leader. In other words, if the leader is empty, then it tries to output records as MARC-8, and when the leader is a value of "a", it tries to encode the data as UTF-8. If I employ binmode( OUTFILE, ":utf8"), and the output is already UTF-8, then double encoding happens. To test this theory, I fixed a number records in my batch. Specifically, I inserted the letter "a" in position #9 of the leader. I then ran my processing file WITHOUT the employment of binmode, and my output was correct. For example, look at all the glorious characters in the following URL: http://www.catholicresearch.net/vufind/Record/undmarc_001906501 -- Eric Lease Morgan Hesburgh Libraries University of Notre Dame 574/631-8604
Re: reading and writing of utf-8 with marc::batch [double encoding]
Hi, On Wed, Mar 27, 2013 at 11:20 AM, Eric Lease Morgan wrote: > I have isolated a number of problem records. They all contain diacritics, > but they do not have an "a" in position #9 of the leader -- > http://dh.crc.nd.edu/tmp/original.marc Can someone verify that the file > contains UTF-8 characters for me? > I've eyeballed it and confirm that the encoding of that file is UTF-8. For these same records I have also added an "a" in position #9 and created > a similar file -- http://dh.crc.nd.edu/tmp/fixed.marc I've looked this over as well. > Is it true that original.marc is not denoted correctly, but fixed.marc is > denoted correctly? > Yes. The Leader/09 must be set to 'a' if the character encoding in use is UTF-8. Regards, Galen -- Galen Charlton gmcha...@gmail.com
Re: reading and writing of utf-8 with marc::batch [double encoding]
A number of people have alluded to the problem of double encoding, and I'm beginning to think this is true. I have isolated a number of problem records. They all contain diacritics, but they do not have an "a" in position #9 of the leader -- http://dh.crc.nd.edu/tmp/original.marc Can someone verify that the file contains UTF-8 characters for me? For these same records I have also added an "a" in position #9 and created a similar file -- http://dh.crc.nd.edu/tmp/fixed.marc Is it true that original.marc is not denoted correctly, but fixed.marc is denoted correctly? -- Eric Morgan
Re: reading and writing of utf-8 with marc::batch
Whenever I see characters like é, I consult this website http://www.i18nqa.com/debug/bug-utf-8-latin1.html to help me figure out what's going on. You might find it helpful too. Shelley - Original Message - > From: "Eric Lease Morgan" > To: perl4lib@perl.org > Sent: Tuesday, March 26, 2013 1:22:03 PM > Subject: reading and writing of utf-8 with marc::batch > > > For the life of me I can't figure out how to do reading and writing > of UTF-8 with MARC::Batch. > > I have a UTF-8 encoded file of MARC records. Dumping the records and > greping for a particular string illustrates the validity: > > $ marcdump und.marc | grep Sainte-Face > und.marc > 1000 records > 2000 records > 3000 records > 4000 records > 5000 records > 6000 records > 7000 records > 8000 records > 9000 records > 1 records > 11000 records > 12000 records > 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face > 610 20 _aArchiconfrérie de la Sainte-Face > 13000 records > $ > > I then run a Perl script that simply reads each record and dumps it > to STDOUT. Notice how I define both my input and output as UTF-8: > > #!/shared/perl/current/bin/perl > > # configure > use constant MARC => './und.marc'; > > # require > use strict; > use MARC::Batch; > > # initialize > binmode ( MARC, ":utf8" ); > my $batch = MARC::Batch->new( 'USMARC', MARC ); > $batch->strict_off; > $batch->warnings_off; > binmode( STDOUT, ":utf8" ); > > # read & write > while ( my $marc = $batch->next ) { print $marc->as_usmarc } > > # done > exit; > > But my output is munged: > > $ ./marc.pl > und.mrc > $ marcdump und.mrc | grep Sainte-Face > und.mrc > 1000 records > 2000 records > 3000 records > 4000 records > 5000 records > 6000 records > 7000 records > 8000 records > 9000 records > 1 records > 11000 records > 12000 records > 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face > 610_aArchiconfrérie de la Sainte-Face > 13000 records > $ > > What am I doing wrong!? > > -- > Eric Lease Morgan > University of Notre Dame > > 574/631-8604 > > > > -- Shelley Doljack E-Resources Metadata Librarian Metadata Department Stanford University Libraries sdolj...@stanford.edu 650-725-0167
Re: reading and writing of utf-8 with marc::batch [terminal]
Hi Eric, On Wed, Mar 27, 2013 at 10:26 AM, Eric Lease Morgan wrote: > While I'm not positive my terminal is doing UTF-8, I think it is. When I > dump in the beginning the output to the terminal is correct. After I run my > script the output to the same terminal is incorrect. > Would you be willing to put up a link to your MARC file? I'm willing to take a quick look to see if I can reproduce the problem you're seeing. Regards, Galen -- Galen Charlton gmcha...@gmail.com
Re: reading and writing of utf-8 with marc::batch [terminal]
On Mar 26, 2013, at 5:57 PM, Leif Andersson wrote: > my first guess would be your terminal is not utf8. While I'm not positive my terminal is doing UTF-8, I think it is. When I dump in the beginning the output to the terminal is correct. After I run my script the output to the same terminal is incorrect. -- Eric Lease Morgan
Re: reading and writing of utf-8 with marc::batch
Hi, On Wed, Mar 27, 2013 at 7:01 AM, Jon Gorman wrote: > One piece of advice is not to trust the terminal directly but pipe > into xxd. (And if possible, just try transforming the offending > record). Or use yaz-marcdump -v, which will also give the hex if I > remember correctly. (If it's c3 a9 in both cases, you know the > terminal is at fault) > Another trick is to pipe the output through less with the LESSCHARSET environment variable set to 'ascii'. Bytes whose value is less than 32 or greater than 136 will be displayed as reverse-video hexadecimal numbers, e.g., Garci<81>a Ma<81>rquez, Gabriel, Regards, Galen -- Galen Charlton gmcha...@gmail.com
Re: reading and writing of utf-8 with marc::batch
Ok, I can't claim to be an expert, but from my own experience, I'd say Paul is very likely right about double-encoding occuring. However, the question ends up being where that happens, and in this case I suspect how MARC::Batch will work could depend heavily on what version of perl you're running and what version of MARC::Batch you're running. That might help too (I'd try to be on a later version of perl, the latest of Batch::MARC ). (It also depends on how you're generating the marc record, which isn't really clear to me. It could also be that the leaders or the terminal as others have suggested. One piece of advice is not to trust the terminal directly but pipe into xxd. (And if possible, just try transforming the offending record). Or use yaz-marcdump -v, which will also give the hex if I remember correctly. (If it's c3 a9 in both cases, you know the terminal is at fault) Then try doing that without the binmode, w/ binmode :raw, etc. Jon Gorman
RE: reading and writing of utf-8 with marc::batch
Eric-- I'm with Leif. The output you got looks like utf-8 displayed on a terminal that doesn't support it. Whether you need to fix the terminal display is another matter--I've never felt compelled to do so. Anyway, I think you can now sign yourself "Eric Did-it-right-the-first-time Morgan!" Mike > -Original Message- > From: Leif Andersson [mailto:leif.anders...@sub.su.se] > Sent: Tuesday, March 26, 2013 5:57 PM > To: Eric Lease Morgan; perl4lib@perl.org > Subject: Re: reading and writing of utf-8 with marc::batch > > Hi Eric, > > my first guess would be your terminal is not utf8. > If you comment out > #binmode( STDOUT, ":utf8" ); > and that does the trick, then you can start looking for how to change > your terminal settings. > (And that can sometimes be a rather frustrating task, I'm afraid) > > /Leif Andersson > Stockholm UL > > Från: Eric Lease Morgan [emor...@nd.edu] > Skickat: den 26 mars 2013 21:22 > Till: perl4lib@perl.org > Ämne: reading and writing of utf-8 with marc::batch > > For the life of me I can't figure out how to do reading and writing of > UTF-8 with MARC::Batch. > > I have a UTF-8 encoded file of MARC records. Dumping the records and > greping for a particular string illustrates the validity: > > $ marcdump und.marc | grep Sainte-Face > und.marc > 1000 records > 2000 records > 3000 records > 4000 records > 5000 records > 6000 records > 7000 records > 8000 records > 9000 records > 1 records > 11000 records > 12000 records > 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face > 610 20 _aArchiconfrérie de la Sainte-Face > 13000 records > $ > > I then run a Perl script that simply reads each record and dumps it to > STDOUT. Notice how I define both my input and output as UTF-8: > > #!/shared/perl/current/bin/perl > > # configure > use constant MARC => './und.marc'; > > # require > use strict; > use MARC::Batch; > > # initialize > binmode ( MARC, ":utf8" ); > my $batch = MARC::Batch->new( 'USMARC', MARC ); > $batch->strict_off; > $batch->warnings_off; > binmode( STDOUT, ":utf8" ); > > # read & write > while ( my $marc = $batch->next ) { print $marc->as_usmarc } > > # done > exit; > > But my output is munged: > > $ ./marc.pl > und.mrc > $ marcdump und.mrc | grep Sainte-Face > und.mrc > 1000 records > 2000 records > 3000 records > 4000 records > 5000 records > 6000 records > 7000 records > 8000 records > 9000 records > 1 records > 11000 records > 12000 records > 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face > 610_aArchiconfrérie de la Sainte-Face > 13000 records > $ > > What am I doing wrong!? > > -- > Eric Lease Morgan > University of Notre Dame > > 574/631-8604
Re: reading and writing of utf-8 with marc::batch
Hi Eric, my first guess would be your terminal is not utf8. If you comment out #binmode( STDOUT, ":utf8" ); and that does the trick, then you can start looking for how to change your terminal settings. (And that can sometimes be a rather frustrating task, I'm afraid) /Leif Andersson Stockholm UL Från: Eric Lease Morgan [emor...@nd.edu] Skickat: den 26 mars 2013 21:22 Till: perl4lib@perl.org Ämne: reading and writing of utf-8 with marc::batch For the life of me I can't figure out how to do reading and writing of UTF-8 with MARC::Batch. I have a UTF-8 encoded file of MARC records. Dumping the records and greping for a particular string illustrates the validity: $ marcdump und.marc | grep Sainte-Face und.marc 1000 records 2000 records 3000 records 4000 records 5000 records 6000 records 7000 records 8000 records 9000 records 1 records 11000 records 12000 records 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610 20 _aArchiconfrérie de la Sainte-Face 13000 records $ I then run a Perl script that simply reads each record and dumps it to STDOUT. Notice how I define both my input and output as UTF-8: #!/shared/perl/current/bin/perl # configure use constant MARC => './und.marc'; # require use strict; use MARC::Batch; # initialize binmode ( MARC, ":utf8" ); my $batch = MARC::Batch->new( 'USMARC', MARC ); $batch->strict_off; $batch->warnings_off; binmode( STDOUT, ":utf8" ); # read & write while ( my $marc = $batch->next ) { print $marc->as_usmarc } # done exit; But my output is munged: $ ./marc.pl > und.mrc $ marcdump und.mrc | grep Sainte-Face und.mrc 1000 records 2000 records 3000 records 4000 records 5000 records 6000 records 7000 records 8000 records 9000 records 1 records 11000 records 12000 records 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610_aArchiconfrérie de la Sainte-Face 13000 records $ What am I doing wrong!? -- Eric Lease Morgan University of Notre Dame 574/631-8604
Re: reading and writing of utf-8 with marc::batch
Do your records have the utf8 encoding byte set in the LDR? (Byte 9 should be 'a' for utf8). -Tim Timothy Prettyman University of Michigan LIbrary/LIT On Tue, Mar 26, 2013 at 4:22 PM, Eric Lease Morgan wrote: > > For the life of me I can't figure out how to do reading and writing of > UTF-8 with MARC::Batch. > > I have a UTF-8 encoded file of MARC records. Dumping the records and > greping for a particular string illustrates the validity: > > $ marcdump und.marc | grep Sainte-Face > und.marc > 1000 records > 2000 records > 3000 records > 4000 records > 5000 records > 6000 records > 7000 records > 8000 records > 9000 records > 1 records > 11000 records > 12000 records > 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face > 610 20 _aArchiconfrérie de la Sainte-Face > 13000 records > $ > > I then run a Perl script that simply reads each record and dumps it to > STDOUT. Notice how I define both my input and output as UTF-8: > > #!/shared/perl/current/bin/perl > > # configure > use constant MARC => './und.marc'; > > # require > use strict; > use MARC::Batch; > > # initialize > binmode ( MARC, ":utf8" ); > my $batch = MARC::Batch->new( 'USMARC', MARC ); > $batch->strict_off; > $batch->warnings_off; > binmode( STDOUT, ":utf8" ); > > # read & write > while ( my $marc = $batch->next ) { print $marc->as_usmarc } > > # done > exit; > > But my output is munged: > > $ ./marc.pl > und.mrc > $ marcdump und.mrc | grep Sainte-Face > und.mrc > 1000 records > 2000 records > 3000 records > 4000 records > 5000 records > 6000 records > 7000 records > 8000 records > 9000 records > 1 records > 11000 records > 12000 records > 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face > 610_aArchiconfrérie de la Sainte-Face > 13000 records > $ > > What am I doing wrong!? > > -- > Eric Lease Morgan > University of Notre Dame > > 574/631-8604 > > > >
Re: reading and writing of utf-8 with marc::batch
On Tue, Mar 26, 2013 at 04:22:03PM -0400, Eric Lease Morgan wrote: > For the life of me I can't figure out how to do reading and writing of > UTF-8 with MARC::Batch. > > I have a UTF-8 encoded file of MARC records. Dumping the records and > greping for a particular string illustrates the validity: > > $ marcdump und.marc | grep Sainte-Face What is marcdump? > 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face > 610 20 _aArchiconfrérie de la Sainte-Face > 13000 records > $ > > I then run a Perl script that simply reads each record and dumps it to > STDOUT. Notice how I define both my input and output as UTF-8: Try *not* calling binmode and see what happens. Or just call binmode(MARC) without the ':utf8' layer. > 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face > 610_aArchiconfrérie de la Sainte-Face > 13000 records > $ This looks like double-encoding: 6c 27 41 72 63 68 69 63 6f 6e 66 72 c3 83 c2 a9 |l'ArchiconfrÃ.©| 0010 72 69 65 |rie| LATIN SMALL LETTER E WITH ACUTE is supposed to be c3 a9 (as it is in the first marcdump output) not c3 83 c2 a9. Paul. -- Paul Hoffman