Apologies for the top-post. Look into the NKF library that comes with ruby 1.8.5 (later patchlevels). It has some awesome mixins into the string class that can help you normalize your strings to utf-8. It will also set $KCODE appropriately, which is the variable that controls your default string encoding. This is vital. Ruby is also extremely tolerant of malformed UTF-8, something that we can probably thank Tim Bray for.
Notes with Hpricot that may or may not still be relevant (I'm using it for a rather large project involving utf-8 encoded HTML, and may be using an old version - 0.5.x) - Hpricot will preserve formatting that is not XML compliant. Be aware of this and attempt to normalize ahead of time if necessary and use the Hpricot::XML constructor. Libtidy does a decent job. - Using Hpricot's built in (non-ruby) character set support is a good way to get nothing back. - Passing any arguments to Hpricot's constructor (other than the content) is a good way to get malformed output back. Really at this point, if you need something really robust and well- tested, LibXML2 is probably a better choice and has a DOM-compliant interface, but I don't believe the ruby support is that great. Worth a look, if it had been an option when I started this project I'd had been all over it. If you're an API connoisseur Hpricot is slightly better. On May 8, 2007, at 6:22 PM, Eric Wilhelm wrote: > # from Javan Makhmali > # on Tuesday 08 May 2007 03:22 pm: > >> like ‘ in the title >> and description are being mangled with strange multibyte characters > > That would be utf8. > >> Does anyone know why this happens > > An xml parser such as expat will output utf8 instead of named > character > entities for all characters which are not "<"=< and "&"=&. > That > might be configurable, but it is often dictated by the xml input. I'm > not sure exactly what is under the hood of ruby's standard rss parser > but it might well be expat. > >> and how I might fix / work around it? > > The best way to *properly* deal with it is to treat it as > characters and > not bytes, though that means your database layer, string objects, and > output layer all need to understand characters to some extent (of > course, low-byte ascii is a subset of utf8, so you could just flag > anything loaded from bag-o-bytes storage as characters and > generally be > on your merry way.) If you're outputting to a browser, the doctype > should be utf8, etc, etc. > > The improper way to deal with it is to strip them, though that can be > difficult to do on the encoded end if all you have is bytes (you > basically have to implement utf8 yourself :-) Alternatively, you > could > s/&[^;]+;/thbbt/g on the front-end or other similarly hackish > workarounds. > > Have fun. > > --Eric > -- > The opinions expressed in this e-mail were randomly generated by > the computer and do not necessarily reflect the views of its owner. > --Management > --------------------------------------------------- > http://scratchcomputing.com > --------------------------------------------------- > _______________________________________________ > PDXRuby mailing list > [email protected] > IRC: #pdx.rb on irc.freenode.net > http://lists.pdxruby.org/mailman/listinfo/pdxruby -- Erik Hollensbe [EMAIL PROTECTED] _______________________________________________ PDXRuby mailing list [email protected] IRC: #pdx.rb on irc.freenode.net http://lists.pdxruby.org/mailman/listinfo/pdxruby
