Re: [OPEN-ILS-GENERAL] Mangled UTF8 characters with imported MARC records in Z39.50

Brent Mills Sat, 03 Dec 2016 11:11:42 -0800

Jason and Mike,

Thanks so much for the help! Glad to know that it’s a remote issue and not 
something set up incorrectly on our side.


-Brent
-----------------------------

Brent Mills
Systems Librarian | Sage Library System

email: [email protected]
tickets: https://sagelib.org/support
phone: 541.610.8384

> On Dec 2, 2016, at 2:30 PM, Mike Rylander <[email protected]> wrote:
> 
> Jason hit on (almost certainly) the answer: bad records from sources that 
> don't restrict cataloging to valid character sets.  I'll add a couple 
> comments below for general clarification, as well...
> 
> On Fri, Dec 2, 2016 at 4:52 PM, Brent Mills <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello,
> 
> I’ve recently noticed some issues with imported MARC records from a specific 
> set of Z39.50 servers.
> 
> A noticeable amount of records that are imported through Prospector/MaineCat 
> targets have mangled characters when diacritics, symbols,etc.. are present in 
> the record.
> 
> Does anyone have some ideas on what could be causing the character encoding 
> problems from these particular targets? Or run into this at their own site?
> 
> - dgo.conf has <charset>marc-8</charset>. changing that to usmarc, utf8 has 
> had no effect
> - xml2marc-yaz.cfg is setup like described in 
> https://wiki.evergreen-ils.org/doku.php?id=evergreen-admin:sru_and_z39.50 
> <https://wiki.evergreen-ils.org/doku.php?id=evergreen-admin:sru_and_z39.50> 
> changing the charset options hasn’t had any effect either
> 
> The reason this doesn't change anything is that it's only used to describe 
> how Evergreen will server records to /others/ as a z39.50 server.  Those are 
> not client settings.
>  
> - the encoding/translation problems do not happen with OCLC and Library of 
> Congress targets, it seems to mainly affect servers with the INNOPAC db type. 
> I’m not sure if that’s related.
> 
> 
> This and the log message below are the smoking guns.  OCLC and LoC are 
> generally very good about making sure records really are in the character set 
> they advertise, and that that character set is one of only MARC-8 or UTF8.
> 
> So, Jason nailed it -- there are non-UTF8, non-MARC-8 characters in those 
> records, as served by the INNOPAC sources.  That's a (remote) cataloging 
> issue.
> 
> HTH,
> 
> --Mike
> 
> Going through the logs I can see things like:
> 
> open-ils.search.z3950.search_class: no mapping found for [0x80] at position 
> 56 in Kurt and Joe tangle with the most determined enemy theyâ€™ve ever 
> encountered when a ruthless powerbroker schemes to build a new Egyptian 
> empire as glorious as those of the Pharaohs. Part of his plan rests on the 
> manipulation of a newly discovered aquifer beneath the Sahara, but an even 
> more devastating weapon at his disposal may threaten the entire world: a 
> plant extract known as the black mist, discovered in the City of the Dead and 
> rumored to have the power to take life from the living and restore it to the 
> dead. With the balance of power in Africa and Europe on the verge of tipping, 
> Kurt, Joe, and the rest of the NUMA team will have to fight to discover the 
> truth behind the legendsâ€”but to do that, they have to confront in person 
> the greatest legend of them all: Osiris, the ruler of the Egyptian 
> underworld. g0=ASCII_DEFAULT g1=EXTENDED_LATIN at 
> /usr/share/perl5/MARC/Charset.pm line 308.
> 
> So I’m thinking something is happening in the MARC8 to UTF8 conversion?
> 
> Attaching a screenshot of what it looks like in the Z39.50 Import screen. The 
> 264s have been the most obvious place to see the issue, but it happens in any 
> field with special characters.
> 
> Been banging my head trying to figure out what’s causing this. Any help would 
> be appreciated!
> 
> Thank you,
> 
> -Brent
> 
> <bad264.jpg>
> -----------------------------
> 
> Brent Mills
> Systems Librarian | Sage Library System
> 
> email: [email protected] <mailto:[email protected]>
> tickets: https://sagelib.org/support <https://sagelib.org/support>
> 
>

Re: [OPEN-ILS-GENERAL] Mangled UTF8 characters with imported MARC records in Z39.50

Reply via email to