> However, combining Jon Gorman's recommendation with some Googling, I get:
>
> my $outfile='4788022.edited.bib';
> open (my $output_marc, '>', $outfile) or die "Couldn't open file $!" ;
> binmode($output_marc, ':utf8');
>
> The open statement may not be quite correct, as I am not familiar with the
> more current techniques for opening file handles that John mentioned.
> However, when I use those instructions to open the output file rather than
> what
> I had before, the copyright symbol does indeed come across as C2 A9 as it was
> in the original record. I didn't want to use the utf8, because I've tried that
> before and ended up with double-encoding (and a real mess). But I'll continue
> testing.
I think I understand how your original problem came about, but I may not be
able to explain it! It is important to understand that inside Perl a string
can be encoded in one of two ways:
1) stored in UTF-8, in which case all ASCII-range characters (roughly space,
A-Z, a-z, 0-9 and most of the punctuation you see on a keyboard) will be stored
in a single byte per character, and other characters will be stored in 2, 3, or
4 bytes
2) stored in an eight-bit character set such as ISO Latin 1. In this situation
all characters are stored as a single byte, but non-western European characters
will be unavailable.
Perl tries to store strings in the second form by preference, as it saves
memory and processing time, but it does this in a way which is transparent to
the user, so if you have the string "abc" it will be in the second form. If
you append a copyright symbol it will still be in the second form as that
symbol is present in ISO Latin 1, but if you append a w-circumflex (as used in
Welsh, and not available in ISO Latin 1) or any Chinese, Greek, Cyrillic
character, then the string will be re-encoded in UTF-8 and Perl will flag it to
remember that is how it has been stored. You as a user do not (generally) need
to worry.
The complication is what to do when reading stuff from files or writing them
out again, because then Perl has to decide how to represent stuff for the
outside world. To be successful, you have to tell Perl what encoding is used
for anything you are reading in, so that it can be stored appropriately. If
you read in a copyright symbol from a UTF-8 encoded file but fail to tell Perl
it was in UTF-8, Perl will think it is character C2 followed by A9. Now A9
happens to be the copyright symbol in ISO Latin 1, but C2 is A-circumflex. If
you write it out again, Perl will operate in ISO Latin 1 unless instructed
otherwise, and you will get C2 A9 in the file, which is probably fine, but Perl
did not know that it was meant to be a single character so processing you might
have done, like regular expression matches and finding the length of the
string, would not have worked as expected.
In your case, if the input was MARC records encoded in UTF-8, the Perl MARC
modules will have picked this up and will correctly flag all the data as UTF-8.
But Perl is then at liberty to store it in memory as ISO Latin 1 to save space.
When you use the as_usmarc() function the MARC::File::USMARC.pm module will
build a single string containing the whole record, but as far as I can tell
from the source code, it does not do anything special about the character set.
If the record had UTF-8 encoding when read in, the as_usmarc() value will be
flagged as being in UTF-8. If you have not specified UTF-8 during the open
command or via binmode, then when writing the string to the file it would be
converted to your local 8-bit encoding (e.g. ISO-Latin-1). This would result
in a record which is a bit of a mess, to say the least, because the LDR will
indicate Unicode and the content may not be. You might also get the warning
"wide character in print" if any characters outside ISO Latin 1 were included,
but a copyright symbol would silently be converted to the wrong representation.
Any record in MARC8, however, will be read in as such and will not be mucked
about with by Perl: it will assume it is all in the local 8-bit encoding, and
to output it successfully you should avoid opening the output file with UTF-8
encoding.
In summary:
1. If reading UTF-8 encoded records via the MARC modules, make sure any file
you write is opened with '>:encoding(UTF-8)'
2. If handling records encoded in MARC8, use '>:raw' when outputting.
3. Do not use '>:raw' with UTF-8 encoded records as any characters in the range
U+0080 to U+00FF are at risk of being mangled because Perl's internal encoding
of the string may not be what you expect, being dependent on whether characters
from U+0100 upwards are included.
It *is* possible to read and write records in a mixture of encodings, but you
will need to keep your head!! If you are modifying records you need to ensure
any additional text you introduce is supplied in the appropriate encoding as
the MARC modules are not clever enough to handle