RE: printing UTF-8 encoded MARC records with as_usmarc

2012-08-15 Thread Doran, Michael D
Hi Devon,

> I just recently came across this presentation which lays out pretty much
> all the issues with Unicode in perl, and makes some recommendations for
> best practices.

While Nick Patch's presentation is excellent, I'm not sure that it "lays out 
pretty much all the issues with Unicode in perl".  ;-)

To fit that bill, I highly recommend this series of talks given by Tom 
Christiansen at OSCON 2011:

 1. Perl Unicode Essentials
 2. Unicode in Perl Regexes
 3. Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly

http://training.perl.com/OSCON2011/index.html
(resolves to http://98.245.80.27/tcpc/OSCON2011/index.html)

If you read through those presentations and disagree, I promise to buy you a 
beer at the next conference (code4lib?) we both attend.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/

> -Original Message-
> From: Smith,Devon [mailto:smit...@oclc.org]
> Sent: Tuesday, July 31, 2012 8:26 AM
> To: William Dueber; Shelley Doljack
> Cc: perl4lib@perl.org
> Subject: RE: printing UTF-8 encoded MARC records with as_usmarc
> 
> I just recently came across this presentation which lays out pretty much
> all the issues with Unicode in perl, and makes some recommendations for
> best practices. You may find some general insight into the whole
> situation by going over it.
> 
> http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-
> perl-workshop-2012
> 
> /dev
> --
> Devon Smith
> Consulting Software Engineer
> OCLC Research
> http://www.oclc.org/research/people/smith.htm
> 
> 
> -Original Message-
> From: William Dueber [mailto:dueb...@umich.edu]
> Sent: Monday, July 30, 2012 8:14 PM
> To: Shelley Doljack
> Cc: perl4lib@perl.org
> Subject: Re: printing UTF-8 encoded MARC records with as_usmarc
> 
> First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
> MARC-8, perhaps just lousy characters) in your MARC. I know we have
> plenty
> of that crap.
> 
> You need to tell perl that you'll be outputting UTF-8 using 'bincode'
> 
>   binmode(FILE, ':utf8');
> 
> In general, you'll want to do this to basically every file you open for
> reading or writing.
> 
> A great overview of Perl and UTF-8 can be found at:
> 
> http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-
> utf-8-by-default
> 
> 
> 
> 
> 
> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack
> wrote:
> 
> > Hi,
> >
> > I wrote a script that extracts marc records from a file given certain
> > conditions and puts them in a new file. When my input record is
> correctly
> > encoded in UTF-8 and I run my script from windows command prompt, this
> > warning message appears: "Wide character in print at
> record_extraction.plline 99" (the line in my script where I print to a
> new file using
> > as_usmarc). I compared the extracted record before and after in
> MarcEdit
> > and the diacritic was changed. I tried marcdump newfile.mrc to see what
> > happens and I get this error: "utf8 \xF4 does not map to Unicode at
> > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script
> again
> > with MARC-8 encoded data then I don't have the same problem.
> >
> > The basic outline of my script is:
> >
> > my $batch = MARC::Batch->new('USMARC', $input_file);
> >
> > while (my $record = $batch->next()) {
> >  #do some checks
> >  #if checks ok then
> >  print FILE $record->as_usmarc();
> > }
> >
> > Do I need to add something that specifies to interpret the data as UTF-
> 8?
> > Does MARC::Record not handle UTF-8 at all?
> >
> > Thanks,
> > Shelley
> >
> > 
> > Shelley Doljack
> > E-Resources Metadata Librarian
> > Metadata and Library Systems
> > Stanford University Libraries
> > sdolj...@stanford.edu
> > 650-725-0167
> >
> 
> 
> 
> --
> 
> Bill Dueber
> Programmer -- Library Systems
> University of Michigan


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-08-01 Thread Shelley Doljack
Hi Matthew,

Thanks for the advice. For this particular script, I'm not doing any data 
manipulation, so using :raw is probably the approach I want to take. I'm just 
feeding my script a list of record IDs and a MARC file in order to pull out 
records that have the record ID I'm looking for.

Thanks,
Shelley

- Original Message -
> From: "PHILLIPS M.E." 
> To: "Shelley Doljack" , perl4lib@perl.org
> Sent: Wednesday, August 1, 2012 1:56:17 AM
> Subject: RE: printing UTF-8 encoded MARC records with as_usmarc
> 
> > -Original Message-
> > From: Shelley Doljack [mailto:sdolj...@stanford.edu]
> > Sent: 31 July 2012 20:18
> >
> > The problem was I wasn't telling perl to output UTF-8. Now that I
> > added
> > binmode(FILE, ':utf8') to my script, the problem is fixed. However,
> > it sounds
> > like once I set binmode to UTF-8 everything will be interpreted as
> > such, even
> > when the record is in MARC-8. Is that right? So this means that I
> > can only use
> > my script with a file of records where all of them are encoded in
> > UTF-8. If I
> > want to run the script against a file with all MARC-8 encoding,
> > then I'd need
> > to remove the binmode line.
> 
> It depends how much manipulation of the records you are doing in the
> script.  One approach is to use
> 
> binmode(FILE, ':raw');
> 
> for both input and output.  Perl will then keep the bytes of the
> records exactly as they are.  You won't be able to test  for exotic
> characters so easily, and amending field content would be
> inadvisable, but if all you are doing is something like reading in
> the records and filtering out any that have no 245 field, or
> something fairly basic like that, this could be the best approach.
> 
> The MARC::Record module does not seem to care how the records are
> encoded.  It's only once you start altering field content, testing
> field content, or adding fields that the character set being used
> becomes an issue.  Removing fields would be fine too.
> 
> MARC-8 can be very complex, particularly if other code tables like
> CJK are invoked, or even just Greek or Cyrillic.  If you were
> manipulating field content in that kind of way they converting
> everything to UTF-8 would make things very much easier.
> 
> Matthew
> 
> --
> Matthew Phillips
> Electronic Systems Librarian, Durham University
> Durham University Library, Stockton Road, Durham, DH1 3LY
> +44 (0)191 334 2941
> 
> 
> 


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-08-01 Thread Colin Campbell
On Tue, Jul 31, 2012 at 09:25:55AM -0400, Smith,Devon wrote:
> I just recently came across this presentation which lays out pretty much all 
> the issues with Unicode in perl, and makes some recommendations for best 
> practices. You may find some general insight into the whole situation by 
> going over it.
In the course of preparing the latest edition of the Camel book Tom
Christiansen created a Perl Unicode Cookbook see
http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html

Its available in a few different places on the web

C.

-- 
Colin Campbell
Chief Software Engineer,
PTFS Europe Limited
Content Management and Library Solutions
+44 (0) 800 756 6803 (phone)
+44 (0) 7759 633626  (mobile)
colin.campb...@ptfs-europe.com
skype: colin_campbell2

http://www.ptfs-europe.com


RE: printing UTF-8 encoded MARC records with as_usmarc

2012-08-01 Thread PHILLIPS M.E.
> -Original Message-
> From: Shelley Doljack [mailto:sdolj...@stanford.edu]
> Sent: 31 July 2012 20:18
>
> The problem was I wasn't telling perl to output UTF-8. Now that I added
> binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds
> like once I set binmode to UTF-8 everything will be interpreted as such, even
> when the record is in MARC-8. Is that right? So this means that I can only use
> my script with a file of records where all of them are encoded in UTF-8. If I
> want to run the script against a file with all MARC-8 encoding, then I'd need
> to remove the binmode line.

It depends how much manipulation of the records you are doing in the script.  
One approach is to use

binmode(FILE, ':raw');

for both input and output.  Perl will then keep the bytes of the records 
exactly as they are.  You won't be able to test  for exotic characters so 
easily, and amending field content would be inadvisable, but if all you are 
doing is something like reading in the records and filtering out any that have 
no 245 field, or something fairly basic like that, this could be the best 
approach.

The MARC::Record module does not seem to care how the records are encoded.  
It's only once you start altering field content, testing field content, or 
adding fields that the character set being used becomes an issue.  Removing 
fields would be fine too.

MARC-8 can be very complex, particularly if other code tables like CJK are 
invoked, or even just Greek or Cyrillic.  If you were manipulating field 
content in that kind of way they converting everything to UTF-8 would make 
things very much easier.

Matthew

-- 
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941




Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Dr. Saiful Amin
On Wed, Aug 1, 2012 at 12:47 AM, Shelley Doljack wrote:

> The problem was I wasn't telling perl to output UTF-8. Now that I added
> binmode(FILE, ':utf8') to my script, the problem is fixed. However, it
> sounds like once I set binmode to UTF-8 everything will be interpreted as
> such, even when the record is in MARC-8. Is that right? So this means that
> I can only use my script with a file of records where all of them are
> encoded in UTF-8. If I want to run the script against a file with all
> MARC-8 encoding, then I'd need to remove the binmode line.
>

Sometimes it's easier to use the yaz-marcdump utility for MARC-8 to UTF-8
conversion (it's much faster):
yaz-marcdump -f MARC-8 -t UTF-8 -o marc marc21.in >marc21.out

http://www.indexdata.com/yaz/doc/yaz-marcdump.html

Best regards,
Saiful Amin
DRTC, Bangalore


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Shelley Doljack
The problem was I wasn't telling perl to output UTF-8. Now that I added 
binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds 
like once I set binmode to UTF-8 everything will be interpreted as such, even 
when the record is in MARC-8. Is that right? So this means that I can only use 
my script with a file of records where all of them are encoded in UTF-8. If I 
want to run the script against a file with all MARC-8 encoding, then I'd need 
to remove the binmode line. 

It doesn't seem possible to say: 

if ($record->encoding() eq 'UTF-8' ) { 
binmode(FILE, ':utf8') ; 
FILE $record->as_usmarc() ; 
} 
else { 
print FILE $record->as_usmarc() ; 
} 

This will result in messing up the diacritics if a file has a mixture of 
records in MARC-8 and UTF-8. Is that correct? 

Thanks, 
Shelley 

- Original Message -

> From: "William Dueber" 
> To: "Shelley Doljack" 
> Cc: perl4lib@perl.org
> Sent: Monday, July 30, 2012 5:13:41 PM
> Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

> First off, it's entirely possible that you have bad UTF-8 (perhaps
> rogue MARC-8, perhaps just lousy characters) in your MARC. I know we
> have plenty of that crap.

> You need to tell perl that you'll be outputting UTF-8 using 'bincode'

> binmode(FILE, ':utf8');

> In general, you'll want to do this to basically every file you open
> for reading or writing.

> A great overview of Perl and UTF-8 can be found at:

> http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack <
> sdolj...@stanford.edu > wrote:

> > Hi,
> 

> > I wrote a script that extracts marc records from a file given
> > certain
> > conditions and puts them in a new file. When my input record is
> > correctly encoded in UTF-8 and I run my script from windows command
> > prompt, this warning message appears: "Wide character in print at
> > record_extraction.pl line 99" (the line in my script where I print
> > to a new file using as_usmarc). I compared the extracted record
> > before and after in MarcEdit and the diacritic was changed. I tried
> > marcdump newfile.mrc to see what happens and I get this error:
> > "utf8
> > \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176."
> > When I run my extraction script again with MARC-8 encoded data then
> > I don't have the same problem.
> 

> > The basic outline of my script is:
> 

> > my $batch = MARC::Batch->new('USMARC', $input_file);
> 

> > while (my $record = $batch->next()) {
> 
> > #do some checks
> 
> > #if checks ok then
> 
> > print FILE $record->as_usmarc();
> 
> > }
> 

> > Do I need to add something that specifies to interpret the data as
> > UTF-8? Does MARC::Record not handle UTF-8 at all?
> 

> > Thanks,
> 
> > Shelley
> 

> > 
> 
> > Shelley Doljack
> 
> > E-Resources Metadata Librarian
> 
> > Metadata and Library Systems
> 
> > Stanford University Libraries
> 
> > sdolj...@stanford.edu
> 
> > 650-725-0167
> 

> --

> Bill Dueber
> Programmer -- Library Systems
> University of Michigan


RE: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread PHILLIPS M.E.
I recently came across a nasty issue with MARC::Record to do with output of 
Marc-8 encoded records.  I was converting XML (which was in UTF-8) into MARC 
records using MARC::Record and had initially, and successfully, got good UTF-8 
encoded MARC records out at the end.

However, I then could not load them into our LMS, and realised it was going to 
be easier at the LMS end if the records were presented in MARC-8.  While the 
Perl modules largely worked and I got the right MARC-8 representation out at 
the end, the record length and the field offsets and lengths in the directory 
got in a real mess, because the top-bit-set characters in MARC-8 got counted as 
though they were code-points 0x80 to 0xFF encoded as two bytes of UTF-8.  I 
found a solution by hackily recalculating the lengths when needed, but I 
thought I'd mention it as the thread has touched on this area.

Matthew

-- 
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941


> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack
> wrote:
> 
> > Hi,
> >
> > I wrote a script that extracts marc records from a file given certain
> > conditions and puts them in a new file. When my input record is correctly
> > encoded in UTF-8 and I run my script from windows command prompt, this
> > warning message appears: "Wide character in print at
> record_extraction.plline 99" (the line in my script where I print to a new 
> file
> using
> > as_usmarc). I compared the extracted record before and after in MarcEdit
> > and the diacritic was changed. I tried marcdump newfile.mrc to see what
> > happens and I get this error: "utf8 \xF4 does not map to Unicode at
> > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again
> > with MARC-8 encoded data then I don't have the same problem.
> >
> > The basic outline of my script is:
> >
> > my $batch = MARC::Batch->new('USMARC', $input_file);
> >
> > while (my $record = $batch->next()) {
> >  #do some checks
> >  #if checks ok then
> >  print FILE $record->as_usmarc();
> > }
> >
> > Do I need to add something that specifies to interpret the data as UTF-8?
> > Does MARC::Record not handle UTF-8 at all?
> >
> > Thanks,
> > Shelley
> >
> > 
> > Shelley Doljack
> > E-Resources Metadata Librarian
> > Metadata and Library Systems
> > Stanford University Libraries
> > sdolj...@stanford.edu
> > 650-725-0167
> >



RE: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Smith,Devon
I just recently came across this presentation which lays out pretty much all 
the issues with Unicode in perl, and makes some recommendations for best 
practices. You may find some general insight into the whole situation by going 
over it.

http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-perl-workshop-2012

/dev
-- 
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm


-Original Message-
From: William Dueber [mailto:dueb...@umich.edu] 
Sent: Monday, July 30, 2012 8:14 PM
To: Shelley Doljack
Cc: perl4lib@perl.org
Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty
of that crap.

You need to tell perl that you'll be outputting UTF-8 using 'bincode'

  binmode(FILE, ':utf8');

In general, you'll want to do this to basically every file you open for
reading or writing.

A great overview of Perl and UTF-8 can be found at:

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default





On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack wrote:

> Hi,
>
> I wrote a script that extracts marc records from a file given certain
> conditions and puts them in a new file. When my input record is correctly
> encoded in UTF-8 and I run my script from windows command prompt, this
> warning message appears: "Wide character in print at record_extraction.plline 
> 99" (the line in my script where I print to a new file using
> as_usmarc). I compared the extracted record before and after in MarcEdit
> and the diacritic was changed. I tried marcdump newfile.mrc to see what
> happens and I get this error: "utf8 \xF4 does not map to Unicode at
> C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again
> with MARC-8 encoded data then I don't have the same problem.
>
> The basic outline of my script is:
>
> my $batch = MARC::Batch->new('USMARC', $input_file);
>
> while (my $record = $batch->next()) {
>  #do some checks
>  #if checks ok then
>  print FILE $record->as_usmarc();
> }
>
> Do I need to add something that specifies to interpret the data as UTF-8?
> Does MARC::Record not handle UTF-8 at all?
>
> Thanks,
> Shelley
>
> 
> Shelley Doljack
> E-Resources Metadata Librarian
> Metadata and Library Systems
> Stanford University Libraries
> sdolj...@stanford.edu
> 650-725-0167
>



-- 

Bill Dueber
Programmer -- Library Systems
University of Michigan


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-30 Thread William Dueber
First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty
of that crap.

You need to tell perl that you'll be outputting UTF-8 using 'bincode'

  binmode(FILE, ':utf8');

In general, you'll want to do this to basically every file you open for
reading or writing.

A great overview of Perl and UTF-8 can be found at:

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default





On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack wrote:

> Hi,
>
> I wrote a script that extracts marc records from a file given certain
> conditions and puts them in a new file. When my input record is correctly
> encoded in UTF-8 and I run my script from windows command prompt, this
> warning message appears: "Wide character in print at record_extraction.plline 
> 99" (the line in my script where I print to a new file using
> as_usmarc). I compared the extracted record before and after in MarcEdit
> and the diacritic was changed. I tried marcdump newfile.mrc to see what
> happens and I get this error: "utf8 \xF4 does not map to Unicode at
> C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again
> with MARC-8 encoded data then I don't have the same problem.
>
> The basic outline of my script is:
>
> my $batch = MARC::Batch->new('USMARC', $input_file);
>
> while (my $record = $batch->next()) {
>  #do some checks
>  #if checks ok then
>  print FILE $record->as_usmarc();
> }
>
> Do I need to add something that specifies to interpret the data as UTF-8?
> Does MARC::Record not handle UTF-8 at all?
>
> Thanks,
> Shelley
>
> 
> Shelley Doljack
> E-Resources Metadata Librarian
> Metadata and Library Systems
> Stanford University Libraries
> sdolj...@stanford.edu
> 650-725-0167
>



-- 

Bill Dueber
Programmer -- Library Systems
University of Michigan