Edit report at https://bugs.php.net/bug.php?id=62466&edit=1

 ID:                 62466
 Updated by:         [email protected]
 Reported by:        ed at grooveshark dot com
 Summary:            levenshtein returns bytes different, not characters
                     different
-Status:             Open
+Status:             Not a bug
 Type:               Bug
 Package:            I18N and L10N related
 PHP Version:        5.4.4
 Block user comment: N
 Private report:     N

 New Comment:

PHP strings are byte strings, and are not Unicode aware. This generally extends 
to 
string functions unless documented otherwise.


Previous Comments:
------------------------------------------------------------------------
[2012-07-02 19:22:32] ed at grooveshark dot com

Description:
------------
The php levenshtein function, documented here:

http://php.net/manual/en/function.levenshtein.php

does not perform as stated with unicode characters over 1 byte in length.  The 
code sample below will print out a character difference of 3, when it should be 
1.  The characters below are some random Japanese characters and use 3 bytes to 
store their values in unicode.  The same behavior can be seen comparing an 
ASCII 
single quote to a unicode right single quote, which also takes 3 bytes vs the 
single byte for the ASCII character.

Test script:
---------------
<?php
printf("%d\n", levenshtein("日", "語"));
?>




Expected result:
----------------
Expected Output: 1

Actual result:
--------------
Actual Output:   3


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=62466&edit=1

Reply via email to