On 5 August 2011 15:37, Ferenc Kovacs <[email protected]> wrote: > On Fri, Aug 5, 2011 at 2:10 PM, Richard Quadling <[email protected]> wrote: >> On 5 August 2011 13:04, Ferenc Kovacs <[email protected]> wrote: >>> On Fri, Aug 5, 2011 at 1:43 PM, Richard Quadling <[email protected]> >>> wrote: >>>> Hello all. >>>> >>>> During the last week, I've been converting the HTML Entities in phpdoc >>>> to their Unicode counterparts, in connection to >>>> http://news.php.net/php.doc.cvs/8536 >>>> >>>> "Remove html entities (the english translation no longer uses any.. if >>>> this breaks translations then they should folow the english one, or if >>>> to much work, we can revert this commit)" >>>> >>>> >>>> In examining the translations, there are a significant number of files >>>> NOT encoded using UTF-8. >>>> >>>> As such, embedding a UTF-8 character in these files will produce garbage. >>>> >>>> As an English only speaker, I am not confident that my convertion from >>>> ISO encoding to UTF-8 encoding is accurate - and that I have no >>>> realistic way to check. >>>> >>>> So, here is a list of all the files requiring someone with the >>>> language skills to look at them and manually convert them. >>>> >>>> >>>> If someone has a routine that can convert ISO encoded XML to UTF-8 >>>> accurately, then I can apply that and then process the entities. >>>> >>>> >>>> cs/bookinfo.xml >>>> cs/faq/generanl.xml >>>> cs/reference/strings/functions/get-html-translation-table.xml >>>> >>>> hk/variables.xml >>>> >>>> hu/bookinfo.xml >>>> hu/language/control-structures.xml >>>> hu/reference/image/functions/imagearc.xml >>>> hu/reference/mbstring/functions/mb-strtoupper.xml >>>> hu/reference/recode/functions/recode-string.xml >>> >>> hi Richard >>> >>> I will fix it for the hungarian files. >>> >>> >>> -- >>> Ferenc Kovács >>> @Tyr43l - http://tyrael.hu >>> >> >> If you could detail what you do in terms of re-encoding, then I'm >> quite happy to rely on that process for the other files. >> >> At some stage, converting all the encoded files to UTF-8 would be a >> nice step, but that is a significant step. If/when that was >> undertaken, I'd suggest adding a pre-commit hook to reject non UTF-8 >> encoded XML files from phpdoc. >> >> -- >> Richard Quadling >> Twitter : EE : Zend : PHPDoc >> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea >> > > Hi Richard, > > first of all, you have to figure out what encoding the files are > using, then re-encode the non-utf8 files from that encoding to utf-8. > > for getting a list of the non utf-8 files, I've used something like this: > find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep -v 'charset=utf-8' > > as I quessed, most of the hungarian files are encoded with iso-8859-2, > some of them was us-ascii, but I removed those files as they were just > copyed from the en without any modification. > find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep > 'us-ascii'|cut -f1 -d ':'|xargs svn rm > > with having the us-ascii files removed, I was left with iso-8859 files > only (which was iso-8859-2 to be correct, the file tool can't tell you > that, but you can know from the language and the xml encoding > attribute). > > so I converted all of those files to utf-8: > find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep > 'charset=iso-8859-1'|cut -f1 -d ':'|xargs recode iso-8859-2..utf-8 > > and replaced the iso-8859-2 occurences in the files: > find ./en/trunk/ -type f|grep -v '\.svn'|cut -f1 -d ':'|xargs sed -i > -e "s/\(encoding=[\'\"]\)iso-8859-1/\1utf-8/gI" > > it should be noted that there are some documentation where this > expression would match and replace unintended stuff, like in > reference/xsl/examples.xml so a better approach would be to parse the > files as xml documents, and change the encoding attribute. > > -- > Ferenc Kovács > @Tyr43l - http://tyrael.hu >
If I tell you I'm on Windows ... Any XML file that contains only ASCII (0x20-0x7F), I've already changed the xml encoding to UTF-8, as there is no difference in the byte values. On the assumption that the XML encode="" value is accurate, would using mb_convert_encoding() be enough? Find files NOT UTF-8, read XML encoding, use mb_convert_encoding() to convert file and save. If that works, then there are 6913 XML files in phpdoc translations NOT UTF-8 (3521 ISO-8859-1, 2901 ISO-8859-2, 425 ISO-8859-7, 65 BIG5 and 1 ISO-8859-8). In doing this, toggling between the two versions (ISO encoded and UTF-8 encoded), my editor doesn't seem to show any differences. If I do a full file comparison (which is encoding aware), my editor says the only difference is in the <?xml > line due to the encoding. Running a diff shows the entire file to be different (as expected). Is that what you'd expect? -- Richard Quadling Twitter : EE : Zend : PHPDoc @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
