Re: Opening & writing to UTF-8 files; copyright symbol again

Jon Gorman Fri, 13 Nov 2015 12:53:03 -0800

I'll ask the easiest solution first ;).

Are you sure the file 4788022.bib is in unicode and not marc-8? If it is in
unicode, is the leader 09 byte set to a?


I'm a bit rusty on the as_usmarc() call as well, you might want to check
the docs to make sure that doesn't do something like convert it to marc-8.

Finally, another thing I'd do is avoid the older conventions for opening
file handles (if you have time, effective perl, second edition is an
excellent book if you wan to do more perl. Nice and short with a good mix
of techniques).

So instead of



encoding()

A method for getting/setting the encoding for a record. The encoding for a
record is determined by position 09 in the leader, which is blank for
MARC-8 encoding, and 'a' for UCS/Unicode. encoding() will return a string,
either 'MARC-8' or 'UTF-8' appropriately.

If you want to set the encoding for a MARC::Record object you can use the
string values:

    $record->encoding( 'UTF-8' );

NOTE: MARC::Record objects created from scratch have an a default encoding
of MARC-8, which has been the standard for years...but many online catlogs
and record vendors are migrating to UTF-8.

WARNING: you should be sure your record really does contain valid UTF-8
data when you manually set the encoding.






On Fri, Nov 13, 2015 at 2:01 PM, Highsmith, Anne L <hism...@library.tamu.edu
> wrote:

> This is related to my previous post (9/17/2015) about deleting 035 fields
> after RDA-ification. Jon Gorman solved that one for me by pointing out that
> I probably had a problem with my perl libraries.
>
>
>
> But now, instead of creating the record from the database and writing it
> back to the database, I am reading from a file exported from my database,
> which is UTF-8. Specifically, the blasted copyright symbol again. As stored
> in the database, the copyright symbol is encoded as C2 A9, which if I read
> the tables correctly, is the correct UTF-8 encoding for copyright. But when
> I read the record from a file and write it back to the file after deleting
> the problematic 035, the encoding for the copyright symbol has been turned
> into A9.
>
>
>
> This “transformation” happens both when running the perl program on my pc
> and on the unix server. Interestingly, complicated Unicode seems to be
> okay. I took a record with Hebrew vernacular characters and edited it using
> my program, then ran the source record and target record through xxd. I
> then diffed the files; it showed no difference. But the before and after of
> the record that has the copyright symbol munges the copyright by stripping
> the C2.
>
>
>
> Here’s the program. If anybody can tell my what I’m doing wrong I’d really
> appreciate it.
>
>
> ----------------------------------------------------------------------------------------------------------
>
> use strict;
>
> use warnings;
>
> use MARC::Record;
>
> use MARC::Batch;
>
> my $infile='4788022.bib';
>
> my $batch = MARC::Batch->new('USMARC',"$infile");
>
> my $outfile='4788022.edited.bib';
>
> open(OUTPUT, ">$outfile");
>
>
>
> while (my $record = $batch->next) {
>
>      my $f001 = $record->field('001');
>
>      my $bib_id = $f001->as_string();
>
>
>
>      my @a035 = $record->field('035');
>
>      foreach my $f035 (@a035) {
>
>            if (my $f035a = $f035->subfield('a')) {
>
>                 if ($f035a eq $bib_id) {
>
>                      $record->delete_field($f035);
>
>                 }
>
>            }
>
>      }
>
>      print OUTPUT $record->as_usmarc();
>
> }
>
>
>
>
>
>
>
> Anne L. Highsmith
>
> Director, Consortia Systems
>
> TAMU Libraries
>
> 5000 TAMU
>
> College Station, TX   77843-5000
>
> 979 862 4234
>
> hism...@tamu.edu
>

Re: Opening & writing to UTF-8 files; copyright symbol again

Reply via email to