Re: [PHP-DOC] HTML entities to UTF-8.

Richard Quadling Fri, 05 Aug 2011 08:12:59 -0700

On 5 August 2011 15:37, Ferenc Kovacs <[email protected]> wrote:
> On Fri, Aug 5, 2011 at 2:10 PM, Richard Quadling <[email protected]> wrote:
>> On 5 August 2011 13:04, Ferenc Kovacs <[email protected]> wrote:
>>> On Fri, Aug 5, 2011 at 1:43 PM, Richard Quadling <[email protected]> 
>>> wrote:
>>>> Hello all.
>>>>
>>>> During the last week, I've been converting the HTML Entities in phpdoc
>>>> to their Unicode counterparts, in connection to
>>>> http://news.php.net/php.doc.cvs/8536
>>>>
>>>> "Remove html entities (the english translation no longer uses any.. if
>>>> this breaks translations then they should folow the english one, or if
>>>> to much work, we can revert this commit)"
>>>>
>>>>
>>>> In examining the translations, there are a significant number of files
>>>> NOT encoded using UTF-8.
>>>>
>>>> As such, embedding a UTF-8 character in these files will produce garbage.
>>>>
>>>> As an English only speaker, I am not confident that my convertion from
>>>> ISO encoding to UTF-8 encoding is accurate - and that I have no
>>>> realistic way to check.
>>>>
>>>> So, here is a list of all the files requiring someone with the
>>>> language skills to look at them and manually convert them.
>>>>
>>>>
>>>> If someone has a routine that can convert ISO encoded XML to UTF-8
>>>> accurately, then I can apply that and then process the entities.
>>>>
>>>>
>>>> cs/bookinfo.xml
>>>> cs/faq/generanl.xml
>>>> cs/reference/strings/functions/get-html-translation-table.xml
>>>>
>>>> hk/variables.xml
>>>>
>>>> hu/bookinfo.xml
>>>> hu/language/control-structures.xml
>>>> hu/reference/image/functions/imagearc.xml
>>>> hu/reference/mbstring/functions/mb-strtoupper.xml
>>>> hu/reference/recode/functions/recode-string.xml
>>>
>>> hi Richard
>>>
>>> I will fix it for the hungarian files.
>>>
>>>
>>> --
>>> Ferenc Kovács
>>> @Tyr43l - http://tyrael.hu
>>>
>>
>> If you could detail what you do in terms of re-encoding, then I'm
>> quite happy to rely on that process for the other files.
>>
>> At some stage, converting all the encoded files to UTF-8 would be a
>> nice step, but that is a significant step. If/when that was
>> undertaken, I'd suggest adding a pre-commit hook to reject non UTF-8
>> encoded XML files from phpdoc.
>>
>> --
>> Richard Quadling
>> Twitter : EE : Zend : PHPDoc
>> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
>>
>
> Hi Richard,
>
> first of all, you have to figure out what encoding the files are
> using, then re-encode the non-utf8 files from that encoding to utf-8.
>
> for getting a list of the non utf-8 files, I've used something like this:
> find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep -v 'charset=utf-8'
>
> as I quessed, most of the hungarian files are encoded with iso-8859-2,
> some of them was us-ascii, but I removed those files as they were just
> copyed from the en without any modification.
> find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep
> 'us-ascii'|cut -f1 -d ':'|xargs svn rm
>
> with having the us-ascii files removed, I was left with iso-8859 files
> only (which was iso-8859-2 to be correct, the file tool can't tell you
> that, but you can know from the language and the xml encoding
> attribute).
>
> so I converted all of those files to utf-8:
> find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep
> 'charset=iso-8859-1'|cut -f1 -d ':'|xargs recode iso-8859-2..utf-8
>
> and replaced the iso-8859-2 occurences in the files:
> find ./en/trunk/ -type f|grep -v '\.svn'|cut -f1 -d ':'|xargs sed -i
> -e "s/\(encoding=[\'\"]\)iso-8859-1/\1utf-8/gI"
>
> it should be noted that there are some documentation where this
> expression would match and replace unintended stuff, like in
> reference/xsl/examples.xml so a better approach would be to parse the
> files as xml documents, and change the encoding attribute.
>
> --
> Ferenc Kovács
> @Tyr43l - http://tyrael.hu
>


If I tell you I'm on Windows ...


Any XML file that contains only ASCII (0x20-0x7F), I've already
changed the xml encoding to UTF-8, as there is no difference in the
byte values.


On the assumption that the XML encode="" value is accurate, would
using mb_convert_encoding() be enough?

Find files NOT UTF-8, read XML encoding, use mb_convert_encoding() to
convert file and save.

If that works, then there are 6913 XML files in phpdoc translations
NOT UTF-8 (3521 ISO-8859-1, 2901 ISO-8859-2, 425 ISO-8859-7, 65 BIG5
and 1 ISO-8859-8).

In doing this, toggling between the two versions (ISO encoded and
UTF-8 encoded), my editor doesn't seem to show any differences.

If I do a full file comparison (which is encoding aware), my editor
says the only difference is in the <?xml > line due to the encoding.

Running a diff shows the entire file to be different (as expected).

Is that what you'd expect?

-- 
Richard Quadling
Twitter : EE : Zend : PHPDoc
@RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea

Re: [PHP-DOC] HTML entities to UTF-8.

Reply via email to