I have implemented fairly complete and robust proper support for character encodings in ruby-marc when reading 'binary' marc under ruby 1.9.

It's currently in a git branch, not yet released, and not yet in git master. https://github.com/ruby-marc/ruby-marc/tree/char_encodings

If anyone who uses this (or doesn't) has a chance to beta test it, it would be appreciated. One way to test, checkout with git, switch to 'char_encodings' branch, and `rake install` to install as a gem to your system. These changes should _only_ effect use under ruby 1.9, and only effect reading in 'binary' (ISO 2709) marc.

The new functionality is pretty extensively covered by automated tests, but there are some weird and complex interactions that can occur depending on exactly what you're doing, bugs are possible. It was somewhat more complicated than one might expect to implement a complete solution here, in part because we _do_ have international users who use ruby-marc, with encodings that are neither MARC8 nor UTF8, and in fact non-MARC21.

If any of the other committers (or anyone else) wants to code review, you are welcome to.

POSSIBLE BACKWARDS INCOMPAT

Some previous 0.4.x versions, when running under ruby 1.9 only, would automatically _transcode_ non-unicode encodings to UTF-8 for you under the hood. The new version no longer does so automatically (although you can ask it to). It was not tenable to support that backwards compatibly.

Everything else _ought_ to be backwards compatible with previous 0.4.x ruby-marc under ruby 1.9, fixing many problems.

NEW FEATURES

All applying to ruby 1.9 only, and to reading binary MARC only.

* Do a pretty good job of setting encodings properly for your ruby environment, especially under standard UTF-8 usage.

* You _can_ and _do have to_ provide an argument for reading non-UTF8 encodings. (but sadly no support for marc8).

* You can ask MARC::Reader to transcode to a different encoding when loading marc.

* You can ask MARC::Reader to replace bytes that are illegal in the believed source encoding with a replacement character (or the empty string) to avoid ruby "invalid UTF-8 byte" exceptions later, and sanitize your input.

New features documented in inline comments, see at:
http://rubydoc.info/github/ruby-marc/ruby-marc/MARC/Reader

I had trouble making the docs concise, sorry, I think I've been pounding my head against this stuff so much realizing how complicated it ends up being that I wasn't sure what to leave out.

Reply via email to