Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Jonathan Rochkind Thu, 08 Mar 2012 12:51:33 -0800

Oh, and why do I favor this solution?

Compared to passing input through as is: You're just prolonging thepain, something downstream is still going to have a problem with it,outputting known illegal data is not a good idea.

Compared to heuristically guessing encoding: Heuristically guessing isokay, but obviously a good deal harder than just replacing bad data withunicode 'replacement' glyph. But honestly, I don't _want_ this kind ofmis-encoded data to be completely transparent -- I want it to dosomething to make the error visible (without stopping the app or datatransformation process in it's tracks), so catalogers can't possiblythink that the data is just fine. If you use heuristics to guess,sometimes those heuristics will fail -- when they do, the catalogerswill think there's something wrong with your logic. "But it works finefor all the other records that you say have the same problem, why can'tit work fine for this one?" But this is partially as a result of mygeneral conclusions, from experience, about trying to heuristically'autocorrect' bad marc data -- I try to do it as minimally as possible.It's too easy to get in a long battle with trying to make yourheuristics better, instead of focusing on, you know, actually fixing thedata.

Now, a place where i'd be willing to use heuristics -- a bulk process totry to actually fix the data in your ILS. Something that goes throughall your marc and flags records that aren't legal for the encoding theyclaim to be. If you want to add heuristics there to try to guess whatencoding they really are and automatically fix em, that doesn't seem aterrible idea to me. But working around the problem with heuristics athigher levels does; spend time on actually fixing the bad data instead.Bad marc data, including illegal char encodings, is a continualinconvenience, you work around it in your pymarc-based software,eventually you'll have some other software in a different language thatyou have to duplicate your workarounds in.


On 3/8/2012 3:45 PM, Jonathan Rochkind wrote:

a) Mis-characterized MARC char encodings are common amongst many ofour corpuses and ILS's. It is a common problem. It can be veryinconvenient. Not only Marc8 that says it's UTF8 and vice versa, butsomething that says it's MARC8 or UTF8 but is actually neither.
b) While one solution would be having the marc tool pass the charstream through as is without complaining like Godmar suggested; andanother solution would be trying to heuristically guess the 'real'solution like Gabe suggests; personally I favor a different solution:
The thing that's encoding as unicode on the way out? Instead ofraising on an invalid char, it should have the option of silentlyeating it, replacing it with either empty string or the unicode"replacement character" ( "used to replace an incoming character whosevalue is unknown or unrepresentable in Unicode"[http://www.fileformat.info/info/unicode/char/fffd/index.htm] )
I have worked with character encoding libraries before that have thisoption, replace messed up bytes with unicode replacement char. I don'tknow what's avail in Python though.
Jonathan

On 3/8/2012 3:19 PM, Gabriel Farrell wrote:
Sounds like what you do, Terry, and what we need in PyMARC, is
something like UnicodeDammit [0]. Actually handling all of these
esoteric encodings would be quite the chore, though.

I also used to think it would be cool if we could get MARC8
encoding/decoding into the Python standard library, but then I
realized I'd rather work on other stuff while MARC8 withers and dies.
[0]https://github.com/bdoms/beautifulsoup/blob/master/BeautifulSoup.py#L1753
On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry
<[email protected]>  wrote:
This is one of the reasons you really can't trust the informationfound in position 9. This is one of the reasons why when I wroteMarcEdit, I utilize a mixed process when working with data anddetermining characterset -- a process that reads this byte and takesthe information under advisement, but in the end treats it more as asuggestion and one part of a larger heuristic analysis of the recorddata to determine whether the information is in UTF8 or not.Fortunately, determining if a set of data is in UTF8 or somethingelse, is a fairly easy process. Determining the something else ismuch more difficult, but generally not necessary.
For that reason, if I was advising other people working on MARCprocessing libraries, I'd advocate having a process for recognizingthat certain informational data may not be set correctly, andessentially utilize a compatibility process to read and correctthem. Because unfortunately, while the number of vendors andsystems that set this encoding byte correctly has increaseddramatically (it used to be pretty much no one) -- but it's still souneven, I generally consider this information unreliable.
--TR

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On BehalfOf Godmar Back
Sent: Thursday, March 08, 2012 11:01 AM
To: [email protected]
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc andmisencoded III records
On Thu, Mar 8, 2012 at 1:46 PM, Terray,James<[email protected]> wrote:
Hi Godmar,
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 inposition 9:
ordinal not in range(128)

Having seen my fair share of these kinds of encoding errors in Python,
I can speculate (without seeing the pymarc source code, so please
don't hold me to this) that it's the Python code that's not set up to
handle the UTF-8 strings from your data source. In fact, the error
indicates it's using the default 'ascii' codec rather than 'utf-8'. If
it said "'utf-8' codec can't decode...", then I'd suspect a problemwith the data.
If you were to send the full traceback (all the gobbledy-gook that
Python spews when it encounters an error) and the version of pymarc
you're using to the program's author(s), they may be able to helpyou out further.
My question is less about the Python error, which I understand, thanabout the MARC record causing the error and about how others dealwith this issue (if it's a common issue, which I do not know.)
But, here's the long story from pymarc's perspective.
The record has leader[9] == 'a', but really, truly containsANSEL-encoded data. When reading the record with aMARCReader(to_unicode = False) instance, the record reads ok sinceno decoding is attempted, but attempts at writing the record failwith the above error since pymarc attempts toutf8 encode the ANSEL-encoded string which contains non-ascii charssuch as0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a'(see [1]).
When reading the record with a MARCReader(to_unicode=True) instance,it'll throw an exception during marc_decode when trying toutf8-decode the ANSEL-encoded string. Rightly so.
I don't blame pymarc for this behavior; to me, the record looks wrong.

  - Godmar
(ps: that said, what pymarc does fails in different circumstances -from what I can see, pymarc shouldn't assume that it's ok toutf8-encode the field data if leader[9] is 'a'. For instance, thiswould double-encode correctly encoded Marc/Unicode records that wereread with aMARCReader(to_unicode=False) instance. But that's a separate issuethat is not my immediate concern. pymarc should probably remember ifa record needs or does not need encoding when writing it, ratherthan consulting the leader[9] field.)
(*)
https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef904c24baee6

Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Reply via email to