MARC::Record / MARC::File::XML bug when fields contain newlines?

arvinport...@lycos.com Wed, 11 Jan 2012 13:36:02 -0800

I've been converting MARC XML records into USMARC and recently had a slew of 
bad records which MARCEdit reported as having invalid leaders. After a few days 
of puzzling over this and blaming it all on Unicode I noticed they were all 
records which contained newlines (0D 0A) in their datafields. As far as I know 
newlines aren't illegal in USMARC, but when I replaced them with spaces, sure 
enough the problem disappeared.


Test record:

<?xml version="1.0" encoding="utf-8"?>
<collection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xmlns:xsd="http://www.w3.org/2001/XMLSchema"; 
xmlns="http://www.loc.gov/MARC21/slim";>
  <record>
    <leader>06965nam a2202005 u 4500</leader>
    <datafield tag="245" ind1="0" ind2="0">
      <subfield code="a">Theoretical and Technological 
Aspects of Crystal Growth</subfield>
    </datafield>
  </record>
</collection>

(If your mail viewer mangles lines, there's a hard return (0D 0A) after the 
word Technological in the 245)

Here is my test program which illustrates the problem:

use MARC::Batch;
use MARC::File::XML (BinaryEncoding => 'utf8', RecordFormat => 'UNIMARC');
use strict 'vars';

open (MARCOUT, ">test_out.marc") or die "Couldn't open test_out.marc for 
writing: $!\n";
binmode(MARCOUT, ':utf8');

my $batch = new MARC::Batch ('XML', 'test.xml');
my $record = $batch->next;
print MARCOUT $record->as_usmarc;

As I said, I don't think newlines are illegal in USMARC so I rather suspect the 
problem is somewhere in MARC::Record. I took the easier route though and 
replaced them with spaces in MARC::File::SAX and that solves the problem:

sub characters {
    my ( $self, $chars ) = @_;
    if (
        ( exists $self->{ subcode } && $self->{ subcode } ne '')
        || ( $self->{ tag } && ( $self->{ tag } eq 'LDR' || $self->{ tag } < 10 
))
    ) { 
        $self->{ chars } .= $chars->{ Data };
        
        ## Added by me, 1/11/2011
        $self->{ chars } =~ s/\n/ /g;
        $self->{ chars } =~ s/ {2,}/ /g;
    } 
}

So is this a bug that can be officially fixed or am I overlooking something?

ActiveState perl 5.10, MARC::Record v.2.0.3, MARC::File::XML v. 0.93

Arvin

MARC::Record / MARC::File::XML bug when fields contain newlines?

Reply via email to