On Fri, 13 Apr 2012 11:56:39 +0200, Kang-Hao (Kenny) Lu <[email protected]> wrote:

(12/04/12 17:09), Yuan Chao wrote:

這裡一直就是兩個很有關系但是不是直接相關的問題:

一、台灣版的瀏覽器(zh-TW)碰到 <meta charset="big5"> 到底該怎麼處理?

A. 使用現況(CP950?)

B. 使用 'big5-uao' 解碼(Firefox)

C. 使用 'big5-hkscs'

... 的選項


二、使用哪種解碼映射可以讓台灣使用者看到最多正確內容?


我覺得不管怎麼樣,問題二都是一個相當科學的考古問題,而我覺得問題一使用問
題二的答案應該是好的。比如說,我覺得 <meta charset="big5"> 就至少要解碼
'big5-uao' 和 'big5-hkscs' 的交集,這至少包括平假名和片假名。

謝謝Kenny,你總結得很好。我也認為問題二是最關鍵的,因此又一次進行了研究……

In English, since the methods used will be of interest also to Anne van Kesteren and possibly others.

My goal was to find a big and representative sample of Big5 usage on Taiwan. Alexa's top million sites [1] lists 2951 .tw sites. Using "site:example.com.tw" searches for all of those using the Bing API [2] generated a list of ~120k URLs.[3] ~116k of those were successfully fetched using a Python script.[4] Another script [5] identified ~38k of them labeled as Big5 and decoded them using the spec algorithm to collect statistics. A final script [6] filtered out ~36k pages with low error rates to exclude misencodings, which is as close to a random sample of Taiwanese Big5 pages that I can get.

The same script identified the pages that would yield different results with the spec mapping (~HKSCS) and the firefox mapping (~UAO), finding 294 such pages. Manually removing obvious misencoded nonsense left 190 which will need more analysis.[7] My initial impression is that a lot of these pages are likely to be garbage, but there are some which are obviously Big5-UAO...

[1] http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
[2] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.py
[3] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.txt
[4] https://gitorious.org/whatwg/big5/blobs/master/get-urls.py
[5] https://gitorious.org/whatwg/big5/blobs/master/tw-json.py
[6] https://gitorious.org/whatwg/big5/blobs/master/tw-analyze.py
[7] https://gitorious.org/whatwg/big5/blobs/master/big5-hkscs-vs-uao.txt

--
Philip Jägenstedt
Core Developer
Opera Software

回复