> I use the following code to get rss and parse it, but the code
> occasionally have issues with gb2312 or big-5 encoded feeds, and fails
> to parse them. However other times may appear just okay. Any thoughts?
> Maybe SimpleXMLElement is simply not meant for other language encodings...
I normalize to UTF-8 before giving to SimpleXML, and it seems okay.
For character set conversions I use both mb_convert_encoding and iconv
and compare to make sure they gave the same result. However for gb2312
and euc-kr I use mb_convert_encoding only; and for windows-1256 and
windows-1254 I use iconv only. [1] shows my code.
HTH,
Darren
[1]:
$s_mb=false;
if($encoding=='gb2312' || $encoding=='euc-kr'){
//iconv is not coping with certain characters very well, so just use
mbstring
$s_iconv=$s_mb=mb_convert_encoding($s,'UTF-8',$encoding);
}
if($s_mb===false){
$s_iconv=iconv($encoding,'UTF-8',$s);
if($encoding=='windows-1256' ||
$encoding='windows-1254')$s_mb=$s_iconv; //Handle encodings not
supported by mb_string extension
else $s_mb=mb_convert_encoding($s,'UTF-8',$encoding);
}
--
Darren Cook, Software Researcher/Developer
http://dcook.org/gobet/ (Shodan Go Bet - who will win?)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php