Thanks, that does indeed do the trick.

>MARC::Record 2.0.0, the so called unicode version, introduced the problem you describe.

Good to know. I hadn't gleaned that fact from all the messages I'd read.

I have a second, related question: MARC::Record 2.0.0 and Encode 2.40 are now more sensitive to leader byte 9. That is, if the leader is set incorrectly for the record's encoding, the program dies with a Unicode error. I deal with tens of thousands of records from a variety of sources and we simply must live with these bad records. I know how to prevent the program from dying and deal with these records by redefining Encode::decode() but that's a blanket solution that ignores all Encode errors. Is there a way to get the program to ignore just the leader 9 mismatch errors (again taking into account the batch will contain a mixture of MARC 8 and UTF-8 encodings)?

Sample of 5 records with incorrect leader 9:
http://www.mediafire.com/file/4wf5mpa9zba5195/badrecs_sample.zip
The records are kind of large, sorry. The error occurs in the first record in the 505, sheet 17. Hsèuan-Ch'eng. The 3rd character of the name is a MARC 8 umlaut, \xE8.

Sample program:

use MARC::Batch;
use bytes;

my $batch = new MARC::Batch('USMARC', $ARGV[0]);
$batch->strict_off ();
$batch->warnings_off ();

my $record = $batch->next;
while ($record) {
   print $record->as_usmarc;
   $record = $batch->next;
}

The later version of MARC::Record will die on the first record. The earlier version will process them all.

Al


At 10/12/2010, Leif Andersson wrote:
>This has nothing to do with Perl versions.
>
>MARC::Record 1.38 and earlier does not display this problem.
>MARC::Record 2.0.0, the so called unicode version, introduced the problem
>you describe.
>That is when writing records: causing incorrect leader length and corrupted
>utf-8
>
>There are different ways to deal with this.
>Myself I have changed one of the modules.
>
>MARC::File::USMARC
>It has a function called encode() around line 315
>I have added a "use bytes;" just before the final return. Like this:
>
>use bytes;
>return join("",$marc->leader, @$directory, END_OF_FIELD, @$fields,
>END_OF_RECORD);
>
>To change directly in code like this is totally "no-no" to many programmers.
>If you feel uncomfortable with this, there are other methods doing the same
>stuff.
>You could write a package:
>
>package MARC_Record_hack;
>use MARC::File::USMARC;
>no warnings 'redefine';
>sub MARC::File::USMARC::encode() {
>    my $marc = shift;
>    $marc = shift if (ref($marc)||$marc) =~ /^MARC::File/;
>    my ($fields,$directory,$reclen,$baseaddress) =
>MARC::File::USMARC::_build_tag_directory($marc);
>    $marc->set_leader_lengths( $reclen, $baseaddress );
>    # Glomp it all together
>    use bytes;
>    return join("",$marc->leader, @$directory, "\x1E", @$fields, "\x1D");
>}
>use warnings;
>1;
>__END__
>
>With the inclusion of this package your original code should work fine, I'd
>guess.
>
>
>use MARC::Batch;
>use MARC_Record_hack;
>my $batch = new MARC::Batch('USMARC', $ARGV[0]);
>$batch->strict_off ();
>$batch->warnings_off ();
>#binmode( STDOUT, ':raw' );
>#binmode STDOUT;
>my $record = $batch->next;
>print $record->as_usmarc;
>
>
>As a habit I use
>binmode FH;
>when I write records to file.
>It is not needed, but it keeps me from the temptation of doing any other
>assumptions about character encodings.
>
>/Leif Andersson
>Stockholm University Library
>
>________________________________________
>Från: Al [ra...@berkeley.edu]
>Skickat: den 12 oktober 2010 00:03
>Till: perl4lib@perl.org
>Ämne: MARC-perl: different versions yield different results
>
>Example marc record is here:
>http://www.mediafire.com/file/u5cxkrfwh9ew09z/example.zip
>
>When I process the record above in perl 5.8, MARC::Record version 1.38, and
>Encode.pm version 2.12, the record comes out fine.
>
>When I use perl 5.10, MARC::Record version 2.0.0, and Encode.pm 2.40 the
>record comes out corrupted and MARC::Record will no longer read the result.
>
>The problem is with a Unicode character (big surprise). The earlier version
>leaves the \xC3A1 character intact, the later version changes it to \xE1
>which is invalid. I've read as many of the perl4lib messages on the subject
>of UTF-8 as I could but my eyes are spinning. I'm hoping by including a
>complete but simple perl program and making a MARC record available that
>somebody can explain to me in detail what is going on. My inclination is to
>simply revert to the earlier version of perl but perhaps if I really
>understood the issue that may not be necessary.
>
>Here is the test program I use:
>
>use MARC::Batch;
>my $batch = new MARC::Batch('USMARC', $ARGV[0]);
>$batch->strict_off ();
>$batch->warnings_off ();
>#binmode( STDOUT, ':utf8' );
>my $record = $batch->next;
>print $record->as_usmarc;
>
>Run the program on the record, then run it again on the output and the
>second time perl quits with an error:
>
>utf8 "\xE1" does not map to Unicode at Encode.pm line 174.
>
>That should not happen.
>
>Why the different behavior with the different versions? I can't see
>anything wrong with the original record - it's valid UTF8 as far as I can
>tell. Leader byte 9 is correctly set to 'a'. Uncommenting the binmode line
>seems to work - the character is output unchanged as is supposed to happen.
>The problem is my record batches are a mixture of UTF8 and MARC8 and
>explicitly setting binmode screws things up. I need a solution that
>transparently handles a mix of record encodings.
>
>I rather suspect the problem is with Encode.pm and not MARC perl but I
>can't be sure. It also may be due to the way perl handles IO between
>version 5.8 and 5.10. BTW the problem happens on Windows and Unix.
>
>Thanks for any advice you can give me,
>
>Al

Reply via email to