Re: [PDX.rb] unwanted multibyte characters from html entities

Erik Hollensbe Tue, 08 May 2007 22:42:40 -0700

Apologies for the top-post.

Look into the NKF library that comes with ruby 1.8.5 (later  
patchlevels). It has some awesome mixins into the string class that  
can help you normalize your strings to utf-8. It will also set $KCODE  
appropriately, which is the variable that controls your default  
string encoding. This is vital. Ruby is also extremely tolerant of  
malformed UTF-8, something that we can probably thank Tim Bray for.


Notes with Hpricot that may or may not still be relevant (I'm using  
it for a rather large project involving utf-8 encoded HTML, and may  
be using an old version - 0.5.x)

  - Hpricot will preserve formatting that is not XML compliant. Be  
aware of this and attempt to normalize ahead of time if necessary and  
use the Hpricot::XML constructor. Libtidy does a decent job.
  - Using Hpricot's built in (non-ruby) character set support is a  
good way to get nothing back.
  - Passing any arguments to Hpricot's constructor (other than the  
content) is a good way to get malformed output back.

Really at this point, if you need something really robust and well- 
tested, LibXML2 is probably a better choice and has a DOM-compliant  
interface, but I don't believe the ruby support is that great. Worth  
a look, if it had been an option when I started this project I'd had  
been all over it. If you're an API connoisseur Hpricot is slightly  
better.

On May 8, 2007, at 6:22 PM, Eric Wilhelm wrote:

> # from Javan Makhmali
> # on Tuesday 08 May 2007 03:22 pm:
>
>> like &lsquo; in the title
>> and description are being mangled with strange multibyte characters
>
> That would be utf8.
>
>> Does anyone know why this happens
>
> An xml parser such as expat will output utf8 instead of named  
> character
> entities for all characters which are not "<"=&lt; and "&"=&amp;.   
> That
> might be configurable, but it is often dictated by the xml input.  I'm
> not sure exactly what is under the hood of ruby's standard rss parser
> but it might well be expat.
>
>> and how I might fix / work around it?
>
> The best way to *properly* deal with it is to treat it as  
> characters and
> not bytes, though that means your database layer, string objects, and
> output layer all need to understand characters to some extent (of
> course, low-byte ascii is a subset of utf8, so you could just flag
> anything loaded from bag-o-bytes storage as characters and  
> generally be
> on your merry way.)  If you're outputting to a browser, the doctype
> should be utf8, etc, etc.
>
> The improper way to deal with it is to strip them, though that can be
> difficult to do on the encoded end if all you have is bytes (you
> basically have to implement utf8 yourself :-)  Alternatively, you  
> could
> s/&[^;]+;/thbbt/g on the front-end or other similarly hackish
> workarounds.
>
> Have fun.
>
> --Eric
> -- 
> The opinions expressed in this e-mail were randomly generated by
> the computer and do not necessarily reflect the views of its owner.
> --Management
> ---------------------------------------------------
>     http://scratchcomputing.com
> ---------------------------------------------------
> _______________________________________________
> PDXRuby mailing list
> [email protected]
> IRC: #pdx.rb on irc.freenode.net
> http://lists.pdxruby.org/mailman/listinfo/pdxruby

--
Erik Hollensbe
[EMAIL PROTECTED]



_______________________________________________
PDXRuby mailing list
[email protected]
IRC: #pdx.rb on irc.freenode.net
http://lists.pdxruby.org/mailman/listinfo/pdxruby

Re: [PDX.rb] unwanted multibyte characters from html entities

Reply via email to