ID:               30549
 User updated by:  david at davidheath dot org
 Reported By:      david at davidheath dot org
 Status:           Open
 Bug Type:         mbstring related
 Operating System: linux
 PHP Version:      4.3.9
 New Comment:

oops, minor bug in that script. Line 35 should read:

            printf("  incorrect mapping of char 0x%x: got 0x%x,
expected 0x%x\n", $fromChar, $unicodeCharNumber[''], $expectChar);

Corrected version of script for your cut+paste convenience:

<?php

testMapping('ISO-8859-7',
            array(
                0xa4=>0x20ac,
                0xa5=>0x20af,
                0xaa=>0x37a)
            );

testMapping('ISO-8859-8',
            array(
                0xaf=>0xaf,
                0xfd=>0x200e,
                0xfe=>0x200f)
            );

testMapping('ISO-8859-10',
            array(
                0xa4=>0x12a
                )
            );

function testMapping($targetEncoding, $map) {
    print "Encoding: $targetEncoding\n";

    foreach($map as $fromChar=>$toChar) {
        $expectChar = $toChar;

        // convert to UCS-4, which represents every possible unicode
        // char as a single fixed width 32bit value
        $unicodeChar=mb_convert_encoding(chr($fromChar), 'UCS-4LE',
$targetEncoding);
        $unicodeCharNumber = unpack('L', $unicodeChar);
        
        if ($expectChar!=$unicodeCharNumber[''] and ($expectChar!=0 and
$unicodeCharNumber!=0x3f)) {
            printf("  incorrect mapping of char 0x%x: got 0x%x,
expected 0x%x\n", $fromChar, $unicodeCharNumber[''], $expectChar);
        }
    }
}
?>


Previous Comments:
------------------------------------------------------------------------

[2004-10-25 13:25:30] david at davidheath dot org

Hi Derick,

ok, I included the charset map parsing code so that you could see that
I was deriving the mappings directly from the unicode mapping files.

Anyway, here is a lean-and-mean version:

<?php

testMapping('ISO-8859-7',
            array(
                0xa4=>0x20ac,
                0xa5=>0x20af,
                0xaa=>0x37a)
            );

testMapping('ISO-8859-8',
            array(
                0xaf=>0xaf,
                0xfd=>0x200e,
                0xfe=>0x200f)
            );

testMapping('ISO-8859-10',
            array(
                0xa4=>0x12a
                )
            );

function testMapping($targetEncoding, $map) {
    print "Encoding: $targetEncoding\n";

    foreach($map as $fromChar=>$toChar) {
        $expectChar = $toChar;

        // convert to UCS-4, which represents every possible unicode
        // char as a single fixed width 32bit value
        $unicodeChar=mb_convert_encoding(chr($fromChar), 'UCS-4LE',
$targetEncoding);
        $unicodeCharNumber = unpack('L', $unicodeChar);
        
        if ($expectChar!=$unicodeCharNumber[''] and ($expectChar!=0 and
$unicodeCharNumber!=0x3f)) {
            printf("  incorrect mapping of char 0x%x: got 0x%x,
expected 0x%x\n", $char, $unicodeCharNumber[''], $expectChar);
        }
    }
}
?>

------------------------------------------------------------------------

[2004-10-25 10:33:42] [EMAIL PROTECTED]

Hello David,

can you please make a *short* script that show that the warnings are
wrong as it takes quite some time to figure out what your script is
exactly doing.

regards,
Derick

------------------------------------------------------------------------

[2004-10-25 09:53:55] david at davidheath dot org

Description:
------------
MBstring appears to incorrectly map some characters for the following
ISO-8859 charsets, as follows:

Encoding: ISO-8859-7
  incorrect mapping of char 0xa4: got 0x3f, expected 0x20ac
  incorrect mapping of char 0xa5: got 0x3f, expected 0x20af
  incorrect mapping of char 0xaa: got 0x3f, expected 0x37a
Encoding: ISO-8859-8
  incorrect mapping of char 0xaf: got 0x203e, expected 0xaf
  incorrect mapping of char 0xfd: got 0x3f, expected 0x200e
  incorrect mapping of char 0xfe: got 0x3f, expected 0x200f
Encoding: ISO-8859-10
  incorrect mapping of char 0xa4: got 0x124, expected 0x12a

This is based on the mappings provided at
ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/ on 25th Oct 2004. 

Note, there are undated comments in the "Version history" for the above
files, as follows:

8859-7:
#       2.0 version updates 1.0 version by adding mappings for the
#       three newly added characters 0xA4, 0xA5, 0xAA.

8859-8:
#       1.1 version updates to the published 8859-8:1999, correcting
#          the mapping of 0xAF and adding mappings for LRM and RLM.

8859-10:
#       1.1 corrected mistake in mapping of 0xA4

So I guess these mappings have changed since mbstring was first
written. I'm not sure if there would be a backward-compatability
problem if the mappings were changed.

Thanks

Dave


Reproduce code:
---------------
Code for this test is available at:

http://davidheath.org/mbstring/mbstring_test.tar.bz2


Expected result:
----------------
Mappings as stated "expected xxx" above.

Actual result:
--------------
Mappings as stated "got xxx" above.


------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=30549&edit=1

Reply via email to