Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Reese, Terry Thu, 08 Mar 2012 12:45:20 -0800

>> I also used to think it would be cool if we could get MARC8 
>> encoding/decoding into the Python standard library, but then I realized I'd 
>> rather work on other stuff while MARC8 withers and dies.


Wouldn't that be nice.  In MarcEdit, all data wants to be treated as UTF8, 
MARC8 support is there as a legacy.  Which is why processing MARC8 data in 
MarcEdit is slightly slower than UTF8 (because there is a kind of emulation 
that occurs to translate charactersets on the fly when needed).

--TR

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Gabriel 
Farrell
Sent: Thursday, March 08, 2012 12:19 PM
To: [email protected]
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded 
III records

Sounds like what you do, Terry, and what we need in PyMARC, is something like 
UnicodeDammit [0]. Actually handling all of these esoteric encodings would be 
quite the chore, though.

I also used to think it would be cool if we could get MARC8 encoding/decoding 
into the Python standard library, but then I realized I'd rather work on other 
stuff while MARC8 withers and dies.


[0] https://github.com/bdoms/beautifulsoup/blob/master/BeautifulSoup.py#L1753

On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry <[email protected]> 
wrote:
> This is one of the reasons you really can't trust the information found in 
> position 9.  This is one of the reasons why when I wrote MarcEdit, I utilize 
> a mixed process when working with data and determining characterset -- a 
> process that reads this byte and takes the information under advisement, but 
> in the end treats it more as a suggestion and one part of a larger heuristic 
> analysis of the record data to determine whether the information is in UTF8 
> or not.  Fortunately, determining if a set of data is in UTF8 or something 
> else, is a fairly easy process.  Determining the something else is much more 
> difficult, but generally not necessary.
>
> For that reason, if I was advising other people working on MARC processing 
> libraries, I'd advocate having a process for recognizing that certain 
> informational data may not be set correctly, and essentially utilize a 
> compatibility process to read and correct them.  Because unfortunately, while 
> the number of vendors and systems that set this encoding byte correctly has 
> increased dramatically (it used to be pretty much no one) -- but it's still 
> so uneven, I generally consider this information unreliable.
>
> --TR
>
> -----Original Message-----
> From: Code for Libraries [mailto:[email protected]] On Behalf 
> Of Godmar Back
> Sent: Thursday, March 08, 2012 11:01 AM
> To: [email protected]
> Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and 
> misencoded III records
>
> On Thu, Mar 8, 2012 at 1:46 PM, Terray, James <[email protected]> wrote:
>
>> Hi Godmar,
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
>> ordinal not in range(128)
>>
>> Having seen my fair share of these kinds of encoding errors in 
>> Python, I can speculate (without seeing the pymarc source code, so 
>> please don't hold me to this) that it's the Python code that's not 
>> set up to handle the UTF-8 strings from your data source. In fact, 
>> the error indicates it's using the default 'ascii' codec rather than 
>> 'utf-8'. If it said "'utf-8' codec can't decode...", then I'd suspect a 
>> problem with the data.
>>
>> If you were to send the full traceback (all the gobbledy-gook that 
>> Python spews when it encounters an error) and the version of pymarc 
>> you're using to the program's author(s), they may be able to help you out 
>> further.
>>
>>
> My question is less about the Python error, which I understand, than 
> about the MARC record causing the error and about how others deal with 
> this issue (if it's a common issue, which I do not know.)
>
> But, here's the long story from pymarc's perspective.
>
> The record has leader[9] == 'a', but really, truly contains 
> ANSEL-encoded data.  When reading the record with a 
> MARCReader(to_unicode = False) instance, the record reads ok since no 
> decoding is attempted, but attempts at writing the record fail with 
> the above error since pymarc attempts to
> utf8 encode the ANSEL-encoded string which contains non-ascii chars 
> such as
> 0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see [1]).
>
> When reading the record with a MARCReader(to_unicode=True) instance, it'll 
> throw an exception during marc_decode when trying to utf8-decode the 
> ANSEL-encoded string. Rightly so.
>
> I don't blame pymarc for this behavior; to me, the record looks wrong.
>
>  - Godmar
>
> (ps: that said, what pymarc does fails in different circumstances - 
> from what I can see, pymarc shouldn't assume that it's ok to 
> utf8-encode the field data if leader[9] is 'a'.  For instance, this 
> would double-encode correctly encoded Marc/Unicode records that were 
> read with a
> MARCReader(to_unicode=False) instance. But that's a separate issue 
> that is not my immediate concern. pymarc should probably remember if a 
> record needs or does not need encoding when writing it, rather than 
> consulting the leader[9] field.)
>
>
> (*)
> https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef
> 904c24baee6

Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Reply via email to