Hi.

Having removed the BOMs, I did a quick analysis of the various
encodings declared in the XML files.

57,020 files examined in PHPDOC and PEARDOC - all languages.

No files with BOM

ISO-8859-1 : 20,761 files
ISO-8859-2 : 3,185 files
ISO-8859-7 : 428 files
ISO-8859-8 : 2 files
WINDOWS-1255 : 194 files
BIG5 : 83 files
GB2312 : 887 files

Thats 25,540 files which are not marked as encoded with UTF-8.


What should the encoding be and what is the impact of it NOT being UTF-8?



The next analysis I did was to see how many files content was only
[\x00-\x7f]. This would show files which would have no content changed
to match UTF-8.

ISO-8859-1 : 11,509 files
ISO-8859-2 : 228 files
ISO-8859-7 : 1 file
ISO-8859-8 : 1 file
WINDOWS-1255 : 27 files
BIG5 : 18 files
GB2312 : 5 files

That's 11,789 files which can be safer retagged as being UTF-8 without
any problems as the content was essentially ASCII only.

That leaves 13,751 files not encoded as UTF-8.


Shall I commit the ascii -> UTF-8 change?

(Running for cover ... )



-- 
-----
Richard Quadling
"Standing on the shoulders of some very clever giants!"
EE : http://www.experts-exchange.com/M_248814.html
Zend Certified Engineer : http://zend.com/zce.php?c=ZEND002498&r=213474731
ZOPA : http://uk.zopa.com/member/RQuadling

Reply via email to