RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

2015-11-16 Thread PHILLIPS M.E.
> You can set the correct encoding succinctly on opening files
>  e.g. open my $fh, '>:encoding(UTF-8)', $outfile

You might also see this even more succinct variant:

open my $fh, '>:utf8', $outfile

though technically speaking, that will not give you guaranteed conformant UTF-8 
because it could contain code points that are excluded from the Unicode 
standard.  So Colin's suggestion is safer.

Matthew



RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

2015-11-16 Thread PHILLIPS M.E.
The copyright symbol is not one of the characters for which there are two 
representations.

One thing that can confuse people about Unicode is the distinction between the 
“code point”[1] and the representation of the code point in the various Unicode 
transformation formats such as UTF-8, UTF-16, UTF-32 and so on.

The copyright symbol has code point A9 (represented in hexadecimal) in both 
ISO-Latin-1 and Unicode, more commonly written with some leading zeros, e.g. 
U+00A9. But when A9 is represented in UTF-8 the actual sequence of bytes in 
memory or in a file is C2 followed by A9.  In UTF-16 and UTF-32 you will see an 
A9 and enough zero bytes to pad to 2 or 4 bytes respectively, but there you 
will have the complication that the bytes may be in big-endian or little-endian 
order, i.e. A9 00 00 00 for little-endian, or 00 00 00 A9 for big endian.

I always find the www.fileformat.info pages useful 
for reference [2].

Matthew


[1] https://en.wikipedia.org/wiki/Code_point
[2] http://www.fileformat.info/info/unicode/char/a9/index.htm

From: Shelley Doljack [mailto:sdolj...@stanford.edu]
Sent: 13 November 2015 22:30
To: Highsmith, Anne L; perl4lib@perl.org
Subject: RE: Opening & writing to UTF-8 files; copyright symbol again -- 
solution

Hey, that’s my post! Anyways, I haven’t really looked into what your problem 
is, but when you said that the copyright character is getting transformed to A9 
even though it is supposedly stored as C2 A9 in the database, it made me think 
of how there can be two UTF-8 representations for the same character in some 
sections of the Unicode set. I wonder if that is somehow happening for you.

Shelley


RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

2015-11-16 Thread PHILLIPS M.E.
> However, combining Jon Gorman's recommendation with some Googling, I get:
> 
> my $outfile='4788022.edited.bib';
> open (my $output_marc, '>', $outfile) or die "Couldn't open file $!" ;
> binmode($output_marc, ':utf8');
> 
> The open statement may not be quite correct, as I am not familiar with the
> more current techniques for opening file handles that John mentioned.
> However, when I use those instructions to open the output file rather than 
> what
> I had before, the copyright symbol does indeed come across as C2 A9 as it was
> in the original record. I didn't want to use the utf8, because I've tried that
> before and ended up with double-encoding (and a real mess). But I'll continue
> testing.

I think I understand how your original problem came about, but I may not be 
able to explain it!  It is important to understand that inside Perl a string 
can be encoded in one of two ways:

1) stored in UTF-8, in which case all ASCII-range characters (roughly space, 
A-Z, a-z, 0-9 and most of the punctuation you see on a keyboard) will be stored 
in a single byte per character, and other characters will be stored in 2, 3, or 
4 bytes

2) stored in an eight-bit character set such as ISO Latin 1. In this situation 
all characters are stored as a single byte, but non-western European characters 
will be unavailable.

Perl tries to store strings in the second form by preference, as it saves 
memory and processing time, but it does this in a way which is transparent to 
the user, so if you have the string "abc" it will be in the second form.  If 
you append a copyright symbol it will still be in the second form as that 
symbol is present in ISO Latin 1, but if you append a w-circumflex (as used in 
Welsh, and not available in ISO Latin 1) or any Chinese, Greek, Cyrillic 
character, then the string will be re-encoded in UTF-8 and Perl will flag it to 
remember that is how it has been stored.  You as a user do not (generally) need 
to worry.

The complication is what to do when reading stuff from files or writing them 
out again, because then Perl has to decide how to represent stuff for the 
outside world.  To be successful, you have to tell Perl what encoding is used 
for anything you are reading in, so that it can be stored appropriately.  If 
you read in a copyright symbol from a UTF-8 encoded file but fail to tell Perl 
it was in UTF-8, Perl will think it is character C2 followed by A9.  Now A9 
happens to be the copyright symbol in ISO Latin 1, but C2 is A-circumflex.  If 
you write it out again, Perl will operate in ISO Latin 1 unless instructed 
otherwise, and you will get C2 A9 in the file, which is probably fine, but Perl 
did not know that it was meant to be a single character so processing you might 
have done, like regular expression matches and finding the length of the 
string, would not have worked as expected.

In your case, if the input was MARC records encoded in UTF-8, the Perl MARC 
modules will have picked this up and will correctly flag all the data as UTF-8. 
But Perl is then at liberty to store it in memory as ISO Latin 1 to save space. 
 When you use the as_usmarc() function the MARC::File::USMARC.pm module will 
build a single string containing the whole record, but as far as I can tell 
from the source code, it does not do anything special about the character set. 
If the record had UTF-8 encoding when read in, the as_usmarc() value will be 
flagged as being in UTF-8.  If you have not specified UTF-8 during the open 
command or via binmode, then when writing the string to the file it would be 
converted to your local 8-bit encoding (e.g. ISO-Latin-1).  This would result 
in a record which is a bit of a mess, to say the least, because the LDR will 
indicate Unicode and the content may not be.  You might also get the warning 
"wide character in print" if any characters outside ISO Latin 1 were included, 
but a copyright symbol would silently be converted to the wrong representation.

Any record in MARC8, however, will be read in as such and will not be mucked 
about with by Perl: it will assume it is all in the local 8-bit encoding, and 
to output it successfully you should avoid opening the output file with UTF-8 
encoding.

In summary:

1. If reading UTF-8 encoded records via the MARC modules, make sure any file 
you write is opened with '>:encoding(UTF-8)'

2. If handling records encoded in MARC8, use '>:raw' when outputting.

3. Do not use '>:raw' with UTF-8 encoded records as any characters in the range 
U+0080 to U+00FF are at risk of being mangled because Perl's internal encoding 
of the string may not be what you expect, being dependent on whether characters 
from U+0100 upwards are included.

It *is* possible to read and write records in a mixture of encodings, but you 
will need to keep your head!!  If you are modifying records you need to ensure 
any additional text you introduce is supplied in the appropriate encoding as 
the MARC modules are not clever enough to handle 

RE: send emails via perl

2014-11-19 Thread PHILLIPS M.E.
 On Wed, Nov 19, 2014 at 02:19:26PM +, PHILLIPS M.E. wrote:
  open (MAIL, |-, '/bin/mailx', '-s', $subject, @addresses)
  || die Failed to e-mail report: $!\n;
 
 what's the point of using perl then?

There's more than one way to do it.

If mailx is already installed and configured then this allows you to send your 
e-mail from your Perl script without installing other modules.  I have 
sometimes had to use servers where installing extra modules from CPAN was not 
under my control.

I was assuming that the original poster's script was doing various other things 
and needing to e-mail a report at the end.  Obviously if the script was only 
doing an e-mail then there would not be much point doing it this way

Matthew



RE: [librecat-dev] A common MARC record path language

2014-02-25 Thread PHILLIPS M.E.
 You also could consider to grok Jason Thomale's Interpreting MARC:
 Where's the Bibliographic Data?  http://journal.code4lib.org/articles/3832 

That's a very good article, as it highlights the problems of the prescribed 
punctuation both getting in the way of extracting parts of the data and its 
role in providing extra context to the subfields.

 It is not a MOM (MARC Object Model) or rather an object model for
 any format derived from ISO 2709 and its concepts of files, records,
 (flavors of) fields and subfields and therefore no abstract API
 can be specified (prescribing that some operation X is defined on
 record objects and yields field objects).

If we are just talking about ISO 2709, the whole family of MARC formats in 
general, then you have to remember that UNIMARC and obsolete formats like 
UKMARC have very different requirements.  UKMARC and UNIMARC are actually much 
easier to work with than MARC21 because the ISBD punctuation is not carried in 
the record but is generated from the subfield tags.  So you don't have to say 
give me the 245 $a and $b but strip / off the end if present because the 
slash is not there.  And there is a different subfield tag to introduce a 
parallel title, so you don't need to distinguish :$b from =$b.

In the UK most libraries have been MARC21 for a decade or more now.  I don't 
know how much use is still made of UNIMARC, or the other national formats, nor 
how good they were.  It seems as though in the last twenty years many countries 
have made moves towards MARC21 because of the sheer numbers of records 
available in that format.  It's just a pity that it's possibly the worst of the 
ISO 2709 formats to work with if you want to repurpose the data!

I hope that BIBFRAME is not going to make the same mistakes.  I have not been 
following that initiative in detail, but I've seen a few examples of data with 
punctuation hanging about at the end.  Hard to tell whether it's prescribed 
punctuation or copying from the book.

The title field, in particular, is much more akin to HTML markup than data 
fields in a database.  In antiquarian cataloguing rules like DCRM, the emphasis 
is on exact transcription from the title page, where the presence or absence of 
punctuation can make a difference in identifying variant editions.  In MARC21 
we get the crazy situation where the cataloguers transcribe the exact 
punctuation from the title page and *add* the ISBD punctuation to the MARC21 
record.  This makes it very hard to present the lay-person with anything 
meaningful.

Matthew

-- 
Matthew Phillips
Head of Digital and Bibliographic Services,
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941



RE: printing UTF-8 encoded MARC records with as_usmarc

2012-08-01 Thread PHILLIPS M.E.
 -Original Message-
 From: Shelley Doljack [mailto:sdolj...@stanford.edu]
 Sent: 31 July 2012 20:18

 The problem was I wasn't telling perl to output UTF-8. Now that I added
 binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds
 like once I set binmode to UTF-8 everything will be interpreted as such, even
 when the record is in MARC-8. Is that right? So this means that I can only use
 my script with a file of records where all of them are encoded in UTF-8. If I
 want to run the script against a file with all MARC-8 encoding, then I'd need
 to remove the binmode line.

It depends how much manipulation of the records you are doing in the script.  
One approach is to use

binmode(FILE, ':raw');

for both input and output.  Perl will then keep the bytes of the records 
exactly as they are.  You won't be able to test  for exotic characters so 
easily, and amending field content would be inadvisable, but if all you are 
doing is something like reading in the records and filtering out any that have 
no 245 field, or something fairly basic like that, this could be the best 
approach.

The MARC::Record module does not seem to care how the records are encoded.  
It's only once you start altering field content, testing field content, or 
adding fields that the character set being used becomes an issue.  Removing 
fields would be fine too.

MARC-8 can be very complex, particularly if other code tables like CJK are 
invoked, or even just Greek or Cyrillic.  If you were manipulating field 
content in that kind of way they converting everything to UTF-8 would make 
things very much easier.

Matthew

-- 
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941




RE: MARC::Batch question

2011-10-14 Thread PHILLIPS M.E.
 I have to admit my Perl skill is very limited, so this may be a dumb
question,
 but I can't seem to find answer.   When I use MARC::Batch to read
records from
 our catalog (III) export file, I can't seem to find a way to skip an
error
 record.   When I ran the following against an III export MARC file, it
stopped
 at a record with error.
 
 utf8 \xBC does not map to Unicode at /usr/lib/perl/5.10/Encode.pm
line 174.

I'm surprised that the error line is being reported from the Encode
module.  Usually modules are written so that an error report tells you
whereabouts the error occurred in the code that was using a feature
provided by the module.  This makes it harder to tell exactly which line
of your own script is triggering the error.  I guess that the author of
the Encode module really did not expect this to happen.

 Ideally I would like to be able to log the error and move to the next
record.

In general you can trap errors by using the eval construct:

eval {
  # ... code that might trigger error in here ...
}
warn $@ if $@;  # print any error from eval block as a warning

See http://perldoc.perl.org/functions/eval.html

Put something like that around the part of your code that triggers the
error and you should get a bit further.

One thing to ask, of course, is why there is an error in the first
place!  It looks like the MARC record is not being converted for the
right character set.  I see you have set strict to be off for the batch.
We have a Millennium system here, and the internal coding is MARC8
rather than UTF8.  I've found that Innovative has a sort of hack to
allow arbitrary Unicode characters to be carried in the MARC record.  We
notice this particularly with records containing directional quotation
marks.  One of the effects is that byte values such as 0x1D can occur
mid-record.  The MARC::File::USMARC module assumes that 0x1D is the end
of record marker.  To get the module to split the records accurately I
had to modify the module as follows:

Change the lines in USMARC.pm that say
 local $/ = END_OF_RECORD;
 my $usmarc = $fh;

to instead say

 
##
# Altered by Matthew Phillips to cope with 0x1D within field values
 
##
#local $/ = END_OF_RECORD;
#my $usmarc = $fh;
my $length;
read($fh, $length, 5) || return;
return unless $length=5;

my $record;
read($fh, $record, $length-5) || return;
my $usmarc = $length.$record;

 
##
# End of alteration
 
##

You should then get all records being split at the right places.  The
alteration relies on the byte count at the start of the record being
accurate, and works nicely for Innovative record output, but if you're
going to be reading records from other sources it may not help as the
byte count can be unreliable.  I submitted the patch to the module
maintainer a few weeks ago and he was considering how to incorporate it
as an option, as it's not appropriate in all circumstances.

There may well still be character conversion issues, however, because
the MARC::Charset module does not know about Innovative's encoding, and
is slightly broken in other respects.  I have not written a patch for
this aspect yet.

Hope that helps a bit!

-- 
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941