> I use the following code to get rss and parse it, but the code
> occasionally have issues with gb2312 or big-5 encoded feeds, and fails
> to parse them. However other times may appear just okay. Any thoughts?
> Maybe SimpleXMLElement is simply not meant for other language encodings...

I normalize to UTF-8 before giving to SimpleXML, and it seems okay.

For character set conversions I use both mb_convert_encoding and iconv
and compare to make sure they gave the same result. However for gb2312
and euc-kr I use mb_convert_encoding only; and for windows-1256 and
windows-1254 I use iconv only. [1] shows my code.

HTH,

Darren


[1]:
$s_mb=false;
if($encoding=='gb2312' || $encoding=='euc-kr'){
    //iconv is not coping with certain characters very well, so just use
mbstring
    $s_iconv=$s_mb=mb_convert_encoding($s,'UTF-8',$encoding);
    }

if($s_mb===false){
    $s_iconv=iconv($encoding,'UTF-8',$s);
    if($encoding=='windows-1256' ||
$encoding='windows-1254')$s_mb=$s_iconv;        //Handle encodings not
supported by mb_string extension
    else $s_mb=mb_convert_encoding($s,'UTF-8',$encoding);
    }


-- 
Darren Cook, Software Researcher/Developer

http://dcook.org/gobet/  (Shodan Go Bet - who will win?)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)

-- 
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to