[CODE4LIB] ruby-marc, better ruby 1.9 char encoding support, testers wanted

Jonathan Rochkind Thu, 19 Apr 2012 14:57:45 -0700

I have implemented fairly complete and robust proper support forcharacter encodings in ruby-marc when reading 'binary' marc under ruby 1.9.

It's currently in a git branch, not yet released, and not yet in gitmaster. https://github.com/ruby-marc/ruby-marc/tree/char_encodings

If anyone who uses this (or doesn't) has a chance to beta test it, itwould be appreciated. One way to test, checkout with git, switch to'char_encodings' branch, and `rake install` to install as a gem to yoursystem. These changes should _only_ effect use under ruby 1.9, and onlyeffect reading in 'binary' (ISO 2709) marc.

The new functionality is pretty extensively covered by automated tests,but there are some weird and complex interactions that can occurdepending on exactly what you're doing, bugs are possible. It wassomewhat more complicated than one might expect to implement a completesolution here, in part because we _do_ have international users who useruby-marc, with encodings that are neither MARC8 nor UTF8, and in factnon-MARC21.

If any of the other committers (or anyone else) wants to code review,you are welcome to.


POSSIBLE BACKWARDS INCOMPAT

Some previous 0.4.x versions, when running under ruby 1.9 only, wouldautomatically _transcode_ non-unicode encodings to UTF-8 for you underthe hood. The new version no longer does so automatically (although youcan ask it to). It was not tenable to support that backwards compatibly.

Everything else _ought_ to be backwards compatible with previous 0.4.xruby-marc under ruby 1.9, fixing many problems.


NEW FEATURES

All applying to ruby 1.9 only, and to reading binary MARC only.

* Do a pretty good job of setting encodings properly for your rubyenvironment, especially under standard UTF-8 usage.

* You _can_ and _do have to_ provide an argument for reading non-UTF8encodings. (but sadly no support for marc8).

* You can ask MARC::Reader to transcode to a different encoding whenloading marc.

* You can ask MARC::Reader to replace bytes that are illegal in thebelieved source encoding with a replacement character (or the emptystring) to avoid ruby "invalid UTF-8 byte" exceptions later, andsanitize your input.


New features documented in inline comments, see at:
http://rubydoc.info/github/ruby-marc/ruby-marc/MARC/Reader

I had trouble making the docs concise, sorry, I think I've been poundingmy head against this stuff so much realizing how complicated it ends upbeing that I wasn't sure what to leave out.

[CODE4LIB] ruby-marc, better ruby 1.9 char encoding support, testers wanted

Reply via email to