Re: [PDX.rb] unwanted multibyte characters from html entities

Eric Wilhelm Tue, 08 May 2007 18:22:43 -0700

# from Javan Makhmali
# on Tuesday 08 May 2007 03:22 pm:

>like &lsquo; in the title  
>and description are being mangled with strange multibyte characters


That would be utf8.

>Does anyone know why this happens

An xml parser such as expat will output utf8 instead of named character 
entities for all characters which are not "<"=&lt; and "&"=&amp;.  That 
might be configurable, but it is often dictated by the xml input.  I'm 
not sure exactly what is under the hood of ruby's standard rss parser 
but it might well be expat.

>and how I might fix / work around it? 

The best way to *properly* deal with it is to treat it as characters and 
not bytes, though that means your database layer, string objects, and 
output layer all need to understand characters to some extent (of 
course, low-byte ascii is a subset of utf8, so you could just flag 
anything loaded from bag-o-bytes storage as characters and generally be 
on your merry way.)  If you're outputting to a browser, the doctype 
should be utf8, etc, etc.

The improper way to deal with it is to strip them, though that can be 
difficult to do on the encoded end if all you have is bytes (you 
basically have to implement utf8 yourself :-)  Alternatively, you could 
s/&[^;]+;/thbbt/g on the front-end or other similarly hackish 
workarounds.

Have fun.

--Eric
-- 
The opinions expressed in this e-mail were randomly generated by
the computer and do not necessarily reflect the views of its owner.
--Management
---------------------------------------------------
    http://scratchcomputing.com
---------------------------------------------------
_______________________________________________
PDXRuby mailing list
[email protected]
IRC: #pdx.rb on irc.freenode.net
http://lists.pdxruby.org/mailman/listinfo/pdxruby

Re: [PDX.rb] unwanted multibyte characters from html entities

Reply via email to