#28654 [Ana->Asn]: Possible bug in utf8_encode (bit operations)

sniper Sun, 11 Jul 2004 12:34:24 -0700

 ID:               28654
 Updated by:       [EMAIL PROTECTED]
 Reported By:      [EMAIL PROTECTED]
-Status:           Analyzed
+Status:           Assigned
 Bug Type:         *Languages/Translation
 Operating System: WinXP
 PHP Version:      4.3.4
-Assigned To:      
+Assigned To:      moriyoshi
 New Comment:


Moriyoshi: Was that last comment a statement of this being a bug in PHP
or what? Is this verified bug? Can you fix it if it is? (if it's not bug
-> bogus..)



Previous Comments:
------------------------------------------------------------------------

[2004-06-14 21:00:29] [EMAIL PROTECTED]

Looks like you are trying to do the conversion between 
the code page 1252 and UTF-8.

http://www.microsoft.com/globaldev/reference/sbcs/
1252.htm

Let alone mbstring, most of iconv() implementations 
support CP1252 (a.k.a. IBM1252).

HTH


------------------------------------------------------------------------

[2004-06-10 00:11:14] [EMAIL PROTECTED]

Hm, what ISO standard do I use (german, Win32) when I paste&copy Word
text into a textarea and post it to a PHP script?
Is it possible to solve my problem by converting my character encoding
to iso-8859-1 with the mb-functions?

------------------------------------------------------------------------

[2004-06-08 09:38:23] [EMAIL PROTECTED]

utf8_encode only deals with iso-8859-1, which does not define
characters in the range from 128 to 160. Though it should probably just
replace those characters with a question mark, as that's how invalid
characters are usually converted.

------------------------------------------------------------------------

[2004-06-06 22:55:32] [EMAIL PROTECTED]

Description:
------------
Hi!

I'm currently developing a nice script that generates OpenOffice SXW
files by filling the content.xml (which is UTF-8 encoded) with database
content. While trying to do this I found out that utf8_encode('�')
(charcode 147) returns ''. But when I checked the whole result in
OffenOffice '�' is displayed as square (character unknown?!). So I made
some tests with UTF-8 conversion (even mb_* functions) and recognized
that characters between 128 and 160 returned by utf8_encode() don�t
seem to match the standard. As mentioned above '�' is returned as ''
but should be '’' (as you will get it using UltraEdit for
conversion).

Does anyone can give me some explanations here?

I�m not familiar with this UTF-8 / bit-conversion stuff, but I don�t
think PHP does what it�s supposed to do here. For a first workaround I
simply coded a custom_utf8_encode() that uses an own char map to
override this misbehaviour (see below). Can someone help my out with
this strange bug?!

Regards
Bjoern Kraus


function custom_utf8_encode($str)
{
    $chrMap = array(128 => '�', 129 => '',  130 => '‚', 131 =>
'ƒ',
                    132 => '„', 133 => '…', 134 => '� ', 135 =>
'‡',
                    136 => 'ˆ',  137 => '‰', 138 => '� ',  139 =>
'‹',
                    140 => 'Œ',  141 => '',  142 => 'Ž',  143 =>
'',
                    144 => '',  145 => '‘', 146 => '’', 147 =>
'“',
                    148 => '”', 149 => '•', 150 => '–', 151 =>
'—',
                    152 => '˜',  153 => '™', 154 => 'š',  155 =>
'›',
                    156 => 'œ',  157 => '',  158 => 'ž',  159 =>
'Ÿ');
                    
    $newStr = '';

    for ($i = 0; $i < strlen($str); $i++) {
        $chrVal = ord($str[$i]);
        if ($chrVal > 127 && $chrVal < 160) {
            $newStr .= $chrMap[$chrVal];
        }
        else {
            $newStr .= utf8_encode($str[$i]);
        }
    }
    
    return $newStr;
}




------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=28654&edit=1

#28654 [Ana->Asn]: Possible bug in utf8_encode (bit operations)

Reply via email to