RE: MARC::Batch question

PHILLIPS M.E. Fri, 14 Oct 2011 01:54:30 -0700

> I have to admit my Perl skill is very limited, so this may be a dumb
question,
> but I can't seem to find answer.   When I use MARC::Batch to read
records from
> our catalog (III) export file, I can't seem to find a way to skip an
error
> record.   When I ran the following against an III export MARC file, it
stopped
> at a record with error.
> 
> utf8 "\xBC" does not map to Unicode at /usr/lib/perl/5.10/Encode.pm
line 174.


I'm surprised that the error line is being reported from the Encode
module.  Usually modules are written so that an error report tells you
whereabouts the error occurred in the code that was using a feature
provided by the module.  This makes it harder to tell exactly which line
of your own script is triggering the error.  I guess that the author of
the Encode module really did not expect this to happen.

> Ideally I would like to be able to log the error and move to the next
record.

In general you can trap errors by using the "eval" construct:

eval {
  # ... code that might trigger error in here ...
}
warn $@ if $@;  # print any error from eval block as a warning

See http://perldoc.perl.org/functions/eval.html

Put something like that around the part of your code that triggers the
error and you should get a bit further.

One thing to ask, of course, is why there is an error in the first
place!  It looks like the MARC record is not being converted for the
right character set.  I see you have set strict to be off for the batch.
We have a Millennium system here, and the internal coding is MARC8
rather than UTF8.  I've found that Innovative has a sort of hack to
allow arbitrary Unicode characters to be carried in the MARC record.  We
notice this particularly with records containing directional quotation
marks.  One of the effects is that byte values such as 0x1D can occur
mid-record.  The MARC::File::USMARC module assumes that 0x1D is the end
of record marker.  To get the module to split the records accurately I
had to modify the module as follows:

Change the lines in USMARC.pm that say
     local $/ = END_OF_RECORD;
     my $usmarc = <$fh>;

to instead say

 
######################################################################
    # Altered by Matthew Phillips to cope with 0x1D within field values
 
######################################################################
#    local $/ = END_OF_RECORD;
#    my $usmarc = <$fh>;
    my $length;
    read($fh, $length, 5) || return;
    return unless $length>=5;

    my $record;
    read($fh, $record, $length-5) || return;
    my $usmarc = $length.$record;

 
######################################################################
    # End of alteration
 
######################################################################

You should then get all records being split at the right places.  The
alteration relies on the byte count at the start of the record being
accurate, and works nicely for Innovative record output, but if you're
going to be reading records from other sources it may not help as the
byte count can be unreliable.  I submitted the patch to the module
maintainer a few weeks ago and he was considering how to incorporate it
as an option, as it's not appropriate in all circumstances.

There may well still be character conversion issues, however, because
the MARC::Charset module does not know about Innovative's encoding, and
is slightly broken in other respects.  I have not written a patch for
this aspect yet.

Hope that helps a bit!

-- 
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941

RE: MARC::Batch question

Reply via email to