Re: Opening & writing to UTF-8 files; copyright symbol again -- solution

2015-11-16 Thread Colin Campbell
On Fri, Nov 13, 2015 at 10:05:01PM +, Highsmith, Anne L wrote:
> I should probably say, "apparent solution" 'cause character set issues never 
> seem to end.
> 
> However, combining Jon Gorman's recommendation with some Googling, I get:
> 
> my $outfile='4788022.edited.bib';
> open (my $output_marc, '>', $outfile) or die "Couldn't open file $!" ;
> binmode($output_marc, ':utf8');
> 
You can set the correct encoding succinctly on opening files 
 e.g. open my $fh, '>:encoding(UTF-8)', $outfile

Hope that helps
C.

-- 
Colin Campbell
Chief Software Engineer,
PTFS Europe Limited
Content Management and Library Solutions
+44 (0) 800 756 6803 (phone)
+44 (0) 7759 633626  (mobile)
colin.campb...@ptfs-europe.com
skype: colin_campbell2

http://www.ptfs-europe.com


RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

2015-11-16 Thread PHILLIPS M.E.
> You can set the correct encoding succinctly on opening files
>  e.g. open my $fh, '>:encoding(UTF-8)', $outfile

You might also see this even more succinct variant:

open my $fh, '>:utf8', $outfile

though technically speaking, that will not give you guaranteed conformant UTF-8 
because it could contain code points that are excluded from the Unicode 
standard.  So Colin's suggestion is safer.

Matthew



RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

2015-11-16 Thread PHILLIPS M.E.
The copyright symbol is not one of the characters for which there are two 
representations.

One thing that can confuse people about Unicode is the distinction between the 
“code point”[1] and the representation of the code point in the various Unicode 
transformation formats such as UTF-8, UTF-16, UTF-32 and so on.

The copyright symbol has code point A9 (represented in hexadecimal) in both 
ISO-Latin-1 and Unicode, more commonly written with some leading zeros, e.g. 
U+00A9. But when A9 is represented in UTF-8 the actual sequence of bytes in 
memory or in a file is C2 followed by A9.  In UTF-16 and UTF-32 you will see an 
A9 and enough zero bytes to pad to 2 or 4 bytes respectively, but there you 
will have the complication that the bytes may be in big-endian or little-endian 
order, i.e. A9 00 00 00 for little-endian, or 00 00 00 A9 for big endian.

I always find the www.fileformat.info<http://www.fileformat.info> pages useful 
for reference [2].

Matthew


[1] https://en.wikipedia.org/wiki/Code_point
[2] http://www.fileformat.info/info/unicode/char/a9/index.htm

From: Shelley Doljack [mailto:sdolj...@stanford.edu]
Sent: 13 November 2015 22:30
To: Highsmith, Anne L; perl4lib@perl.org
Subject: RE: Opening & writing to UTF-8 files; copyright symbol again -- 
solution

Hey, that’s my post! Anyways, I haven’t really looked into what your problem 
is, but when you said that the copyright character is getting transformed to A9 
even though it is supposedly stored as C2 A9 in the database, it made me think 
of how there can be two UTF-8 representations for the same character in some 
sections of the Unicode set. I wonder if that is somehow happening for you.

Shelley


RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

2015-11-16 Thread PHILLIPS M.E.
> However, combining Jon Gorman's recommendation with some Googling, I get:
> 
> my $outfile='4788022.edited.bib';
> open (my $output_marc, '>', $outfile) or die "Couldn't open file $!" ;
> binmode($output_marc, ':utf8');
> 
> The open statement may not be quite correct, as I am not familiar with the
> more current techniques for opening file handles that John mentioned.
> However, when I use those instructions to open the output file rather than 
> what
> I had before, the copyright symbol does indeed come across as C2 A9 as it was
> in the original record. I didn't want to use the utf8, because I've tried that
> before and ended up with double-encoding (and a real mess). But I'll continue
> testing.

I think I understand how your original problem came about, but I may not be 
able to explain it!  It is important to understand that inside Perl a string 
can be encoded in one of two ways:

1) stored in UTF-8, in which case all ASCII-range characters (roughly space, 
A-Z, a-z, 0-9 and most of the punctuation you see on a keyboard) will be stored 
in a single byte per character, and other characters will be stored in 2, 3, or 
4 bytes

2) stored in an eight-bit character set such as ISO Latin 1. In this situation 
all characters are stored as a single byte, but non-western European characters 
will be unavailable.

Perl tries to store strings in the second form by preference, as it saves 
memory and processing time, but it does this in a way which is transparent to 
the user, so if you have the string "abc" it will be in the second form.  If 
you append a copyright symbol it will still be in the second form as that 
symbol is present in ISO Latin 1, but if you append a w-circumflex (as used in 
Welsh, and not available in ISO Latin 1) or any Chinese, Greek, Cyrillic 
character, then the string will be re-encoded in UTF-8 and Perl will flag it to 
remember that is how it has been stored.  You as a user do not (generally) need 
to worry.

The complication is what to do when reading stuff from files or writing them 
out again, because then Perl has to decide how to represent stuff for the 
outside world.  To be successful, you have to tell Perl what encoding is used 
for anything you are reading in, so that it can be stored appropriately.  If 
you read in a copyright symbol from a UTF-8 encoded file but fail to tell Perl 
it was in UTF-8, Perl will think it is character C2 followed by A9.  Now A9 
happens to be the copyright symbol in ISO Latin 1, but C2 is A-circumflex.  If 
you write it out again, Perl will operate in ISO Latin 1 unless instructed 
otherwise, and you will get C2 A9 in the file, which is probably fine, but Perl 
did not know that it was meant to be a single character so processing you might 
have done, like regular expression matches and finding the length of the 
string, would not have worked as expected.

In your case, if the input was MARC records encoded in UTF-8, the Perl MARC 
modules will have picked this up and will correctly flag all the data as UTF-8. 
But Perl is then at liberty to store it in memory as ISO Latin 1 to save space. 
 When you use the as_usmarc() function the MARC::File::USMARC.pm module will 
build a single string containing the whole record, but as far as I can tell 
from the source code, it does not do anything special about the character set. 
If the record had UTF-8 encoding when read in, the as_usmarc() value will be 
flagged as being in UTF-8.  If you have not specified UTF-8 during the open 
command or via binmode, then when writing the string to the file it would be 
converted to your local 8-bit encoding (e.g. ISO-Latin-1).  This would result 
in a record which is a bit of a mess, to say the least, because the LDR will 
indicate Unicode and the content may not be.  You might also get the warning 
"wide character in print" if any characters outside ISO Latin 1 were included, 
but a copyright symbol would silently be converted to the wrong representation.

Any record in MARC8, however, will be read in as such and will not be mucked 
about with by Perl: it will assume it is all in the local 8-bit encoding, and 
to output it successfully you should avoid opening the output file with UTF-8 
encoding.

In summary:

1. If reading UTF-8 encoded records via the MARC modules, make sure any file 
you write is opened with '>:encoding(UTF-8)'

2. If handling records encoded in MARC8, use '>:raw' when outputting.

3. Do not use '>:raw' with UTF-8 encoded records as any characters in the range 
U+0080 to U+00FF are at risk of being mangled because Perl's internal encoding 
of the string may not be what you expect, being dependent on whether characters 
from U+0100 upwards are included.

It *is* possible to read and write records in a mixture of encodings, but you 
will need to keep your head!!  If you are modifying records you need to ensure 
any additional text you introduce is supplied in the appropriate encoding as 
the MARC modules are not clever enough to handle