On 4/04/2010 9:34 AM, Darren Cook wrote:
>> I use the following code to get rss and parse it, but the code
>> occasionally have issues with gb2312 or big-5 encoded feeds, and fails
>> to parse them. However other times may appear just okay. Any thoughts?
>> Maybe SimpleXMLElement is simply not meant for other language encodings...



> I normalize to UTF-8 before giving to SimpleXML, and it seems okay.
> 
> For character set conversions I use both mb_convert_encoding and iconv
> and compare to make sure they gave the same result. However for gb2312
> and euc-kr I use mb_convert_encoding only; and for windows-1256 and
> windows-1254 I use iconv only. [1] shows my code.
> 

We regularly use Chinese RSS feeds, and convert them to Unicode before
we process or sue them. One problem I've noticed with gb2312 and Big5
feeds is that the web developers don't necessarily declare their
encodings correctly.

Some gb2312 pages are actually gbk ... a superset

And there are a number of versions of Big5, including at least two
different supersets ... Big5-HKSCS and Big5-2003

tools like iconv and similar will fail to convert content if they detect
characters outside the Big5 or gb2312 character ranges, i.e. what the
conversion tools identify as invalid character sequences.

Part of the problem is that modern web browsers always treat GB2312 as
GBK so incorrectly labelled HTML will work as it, but some of the tools
that parse XML are less forgiving of developer errors.

> HTH,
> 
> Darren
> 
> 
> [1]:
> $s_mb=false;
> if($encoding=='gb2312' || $encoding=='euc-kr'){
>     //iconv is not coping with certain characters very well, so just use
> mbstring
>     $s_iconv=$s_mb=mb_convert_encoding($s,'UTF-8',$encoding);
>     }
> 
> if($s_mb===false){
>     $s_iconv=iconv($encoding,'UTF-8',$s);
>     if($encoding=='windows-1256' ||
> $encoding='windows-1254')$s_mb=$s_iconv;      //Handle encodings not
> supported by mb_string extension
>     else $s_mb=mb_convert_encoding($s,'UTF-8',$encoding);
>     }
> 
> 

-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000

Ph: +61-3-8664-7430
Fax: +61-3-9639-2175

Email: andr...@vicnet.net.au
Alt email: lang.supp...@gmail.com

http://home.vicnet.net.au/~andrewc/
http://www.openroad.net.au
http://www.vicnet.net.au
http://www.slv.vic.gov.au

-- 
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to