# from Javan Makhmali
# on Tuesday 08 May 2007 03:22 pm:
>like ‘ in the title
>and description are being mangled with strange multibyte characters
That would be utf8.
>Does anyone know why this happens
An xml parser such as expat will output utf8 instead of named character
entities for all characters which are not "<"=< and "&"=&. That
might be configurable, but it is often dictated by the xml input. I'm
not sure exactly what is under the hood of ruby's standard rss parser
but it might well be expat.
>and how I might fix / work around it?
The best way to *properly* deal with it is to treat it as characters and
not bytes, though that means your database layer, string objects, and
output layer all need to understand characters to some extent (of
course, low-byte ascii is a subset of utf8, so you could just flag
anything loaded from bag-o-bytes storage as characters and generally be
on your merry way.) If you're outputting to a browser, the doctype
should be utf8, etc, etc.
The improper way to deal with it is to strip them, though that can be
difficult to do on the encoded end if all you have is bytes (you
basically have to implement utf8 yourself :-) Alternatively, you could
s/&[^;]+;/thbbt/g on the front-end or other similarly hackish
workarounds.
Have fun.
--Eric
--
The opinions expressed in this e-mail were randomly generated by
the computer and do not necessarily reflect the views of its owner.
--Management
---------------------------------------------------
http://scratchcomputing.com
---------------------------------------------------
_______________________________________________
PDXRuby mailing list
[email protected]
IRC: #pdx.rb on irc.freenode.net
http://lists.pdxruby.org/mailman/listinfo/pdxruby