On Fri, 13 Apr 2012 11:56:39 +0200, Kang-Hao (Kenny) Lu
<[email protected]> wrote:
(12/04/12 17:09), Yuan Chao wrote:
這裡一直就是兩個很有關系但是不是直接相關的問題:
一、台灣版的瀏覽器(zh-TW)碰到 <meta charset="big5"> 到底該怎麼處理?
A. 使用現況(CP950?)
B. 使用 'big5-uao' 解碼(Firefox)
C. 使用 'big5-hkscs'
... 的選項
二、使用哪種解碼映射可以讓台灣使用者看到最多正確內容?
我覺得不管怎麼樣,問題二都是一個相當科學的考古問題,而我覺得問題一使用問
題二的答案應該是好的。比如說,我覺得 <meta charset="big5"> 就至少要解碼
'big5-uao' 和 'big5-hkscs' 的交集,這至少包括平假名和片假名。
謝謝Kenny,你總結得很好。我也認為問題二是最關鍵的,因此又一次進行了研究……
In English, since the methods used will be of interest also to Anne van
Kesteren and possibly others.
My goal was to find a big and representative sample of Big5 usage on
Taiwan. Alexa's top million sites [1] lists 2951 .tw sites. Using
"site:example.com.tw" searches for all of those using the Bing API [2]
generated a list of ~120k URLs.[3] ~116k of those were successfully
fetched using a Python script.[4] Another script [5] identified ~38k of
them labeled as Big5 and decoded them using the spec algorithm to collect
statistics. A final script [6] filtered out ~36k pages with low error
rates to exclude misencodings, which is as close to a random sample of
Taiwanese Big5 pages that I can get.
The same script identified the pages that would yield different results
with the spec mapping (~HKSCS) and the firefox mapping (~UAO), finding 294
such pages. Manually removing obvious misencoded nonsense left 190 which
will need more analysis.[7] My initial impression is that a lot of these
pages are likely to be garbage, but there are some which are obviously
Big5-UAO...
[1] http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
[2] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.py
[3] https://gitorious.org/whatwg/big5/blobs/master/tw-urls.txt
[4] https://gitorious.org/whatwg/big5/blobs/master/get-urls.py
[5] https://gitorious.org/whatwg/big5/blobs/master/tw-json.py
[6] https://gitorious.org/whatwg/big5/blobs/master/tw-analyze.py
[7] https://gitorious.org/whatwg/big5/blobs/master/big5-hkscs-vs-uao.txt
--
Philip Jägenstedt
Core Developer
Opera Software