Edit report at https://bugs.php.net/bug.php?id=62466&edit=1
ID: 62466 Updated by: [email protected] Reported by: ed at grooveshark dot com Summary: levenshtein returns bytes different, not characters different -Status: Open +Status: Not a bug Type: Bug Package: I18N and L10N related PHP Version: 5.4.4 Block user comment: N Private report: N New Comment: PHP strings are byte strings, and are not Unicode aware. This generally extends to string functions unless documented otherwise. Previous Comments: ------------------------------------------------------------------------ [2012-07-02 19:22:32] ed at grooveshark dot com Description: ------------ The php levenshtein function, documented here: http://php.net/manual/en/function.levenshtein.php does not perform as stated with unicode characters over 1 byte in length. The code sample below will print out a character difference of 3, when it should be 1. The characters below are some random Japanese characters and use 3 bytes to store their values in unicode. The same behavior can be seen comparing an ASCII single quote to a unicode right single quote, which also takes 3 bytes vs the single byte for the ASCII character. Test script: --------------- <?php printf("%d\n", levenshtein("æ¥", "èª")); ?> Expected result: ---------------- Expected Output: 1 Actual result: -------------- Actual Output: 3 ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=62466&edit=1
