On 4/04/2010 9:34 AM, Darren Cook wrote: >> I use the following code to get rss and parse it, but the code >> occasionally have issues with gb2312 or big-5 encoded feeds, and fails >> to parse them. However other times may appear just okay. Any thoughts? >> Maybe SimpleXMLElement is simply not meant for other language encodings...
> I normalize to UTF-8 before giving to SimpleXML, and it seems okay. > > For character set conversions I use both mb_convert_encoding and iconv > and compare to make sure they gave the same result. However for gb2312 > and euc-kr I use mb_convert_encoding only; and for windows-1256 and > windows-1254 I use iconv only. [1] shows my code. > We regularly use Chinese RSS feeds, and convert them to Unicode before we process or sue them. One problem I've noticed with gb2312 and Big5 feeds is that the web developers don't necessarily declare their encodings correctly. Some gb2312 pages are actually gbk ... a superset And there are a number of versions of Big5, including at least two different supersets ... Big5-HKSCS and Big5-2003 tools like iconv and similar will fail to convert content if they detect characters outside the Big5 or gb2312 character ranges, i.e. what the conversion tools identify as invalid character sequences. Part of the problem is that modern web browsers always treat GB2312 as GBK so incorrectly labelled HTML will work as it, but some of the tools that parse XML are less forgiving of developer errors. > HTH, > > Darren > > > [1]: > $s_mb=false; > if($encoding=='gb2312' || $encoding=='euc-kr'){ > //iconv is not coping with certain characters very well, so just use > mbstring > $s_iconv=$s_mb=mb_convert_encoding($s,'UTF-8',$encoding); > } > > if($s_mb===false){ > $s_iconv=iconv($encoding,'UTF-8',$s); > if($encoding=='windows-1256' || > $encoding='windows-1254')$s_mb=$s_iconv; //Handle encodings not > supported by mb_string extension > else $s_mb=mb_convert_encoding($s,'UTF-8',$encoding); > } > > -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Ph: +61-3-8664-7430 Fax: +61-3-9639-2175 Email: andr...@vicnet.net.au Alt email: lang.supp...@gmail.com http://home.vicnet.net.au/~andrewc/ http://www.openroad.net.au http://www.vicnet.net.au http://www.slv.vic.gov.au
-- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php