From: Operating system: Microsoft Windows XP SP3 PHP version: 5.2.17 Package: ICONV related Bug Type: Bug Bug description:ICONV returns strings with excessive memory useage
Description: ------------ PHP 5.2.17 / libiconv 1.11 / Windows XP SP3 It would appear that, on my machine at least, the result returned by iconv uses the same amount of memory as the input string, even if it doesn't actually need to. This only happens when the result is smaller than the input string. When the result is bigger than the input string, i.e. going from ISO-8859-1 characters above 0x7F, to UTF-8, the resulting memory usage is as expected. To demonstrate, the example code initializes an array of 4 UTF-8 strings, which I have named: n-tilde; multiplication; cyrillic-i; and invalid. Each 1MB string is repeatedly (for dramatic effect) transliterated to ASCII, and the resulting string is stored in a buffer array. The memory usage before and after these repeated transliteration is recorded and displayed. The difference in the memory usage before and after, therefore closely approximates the memory usage of the buffer array. During the transliteration the following occurs: n-tilde: each 2-byte UTF-8 character, U+00F1, is transliterated to the 2-byte ASCII sequence '~n', so each buffer should use 1MB. multiplication: each 2-byte UTF-8 character, U+00D7, is transliterated to the 1-byte ASCII sequence 'x', so each buffer should use 0.5MB. cyrillic-i: each 2-byte UTF-8 character, U+0438, is ignored since there is no transliteration. So iconv returns the empty string. Therefore, each buffer should use 0MB. invalid: 0xFF is invalid in UTF-8 so iconv stops processing the input string at the first character, generates an E_NOTICE (which I mask to make the output more readable) and returns the incomplete result, the empty string. Therefore, each buffer should use 0MB. I am aware that it takes ~68 bytes per entry, plus the size of the data to store the array, however, in this case 16 entries, plus index strings, only amounts to ~1KB, which is insignificant compared to the results. Keeping this in mind though, you would expect additional memory usage caused by the creation of the 16 entry, buffer array to be: ~16MB for n-tilde (16 buffers @ 1MB each); ~8MB for multiplication (16 buffers @ 0.5MB each); ~1KB for cyrillic-i (16 buffers @ 0MB each); ~1KB for invalid (16 buffers @ 0MB each). This ties in very neatly with my expected results, as shown. However, the actual results are significantly different. As you can see, the buffer for each string uses 16MB. Note that this is 16 buffers @ 1MB (the size of the input string). Obviously, this should not be the case. An array of 16 empty strings, in the cases of the cyrillic-i and invalid tests, should not use 16MB of memory. Although I haven't shown it here for brevity, the contents of the buffer after, for example, the invalid test, are indeed 16 empty strings which act like empty strings should. They work just fine. They just use 1MB of memory each. When you strlen them, they report being zero-length as you would expect. But they still use 1MB each. The interesting thing about them is that if you concatenate all the empty strings together and save it in a separate string that string only uses a few bytes, as you would expect. So as soon as you do any string operations of them, the resulting strings use the expected amount of memory. So to get the expected results shown here, I simply cast the result of the iconv call as a string, i.e. $buffer = (string)@iconv(...);. Now, obviously, at least logically, this should make no difference. After all, I'm casting a string as a string. But since casts in PHP are an operator they return a new value. In this case, a new string with the same value and corrected memory usage. You can change the number of repetitions, and/or the input string sizes. The pattern remains the same. The result strings (if smaller) always end up using the same amount of memory as the input string. Change the to- and from- charsets, the pattern remains. Remove the ignore and/or translit flags, it doesn't matter. You still end up with strings that take up more space than they should. I looked at the iconv source code, and to be honest, as I'm not a developer of PHP or PHP modules, it didn't make a whole lot of sense, and I didn't spend a whole lot of time trying to get my head around it. That's for another day/year/life :) I don't know the inner workings of PHP or how it passes data around, or how that ends up as a PHP value accessible in PHP script. But I do understand the principles. Anyway, my best assumption is that when PHP's iconv wrapper is called, an output buffer the size of the input buffer is created and passed to libiconv. When libiconv returns, PHP's iconv wrapper then packages that buffer as a PHP string and makes it accessible to the PHP script. The results shown here would indicate that, nowhere along the way is the output buffer's memory allocation shrunk to fit the size of the actual data returned. Therefore, you end up with a PHP empty string (for example) that actually uses 1MB of data. Test script: --------------- $strings = array( 'n-tilde' => str_repeat("\xC3\xB1", 512 * 1024), 'multiplication' => str_repeat("\xC3\x97", 512 * 1024), 'cyrillic-i' => str_repeat("\xD0\xB8", 512 * 1024), 'invalid' => str_repeat("\xFF", 1024 * 1024), ); foreach ($strings as $name => $value) { $before = round(memory_get_usage() / (1024 * 1024), 4); $buffer = array(); for ($i = 0; $i < 16; ++$i) $buffer[] = @iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $value); $after = round(memory_get_usage() / (1024 * 1024), 4); unset($buffer); echo "{$name}: before={$before}MB, after={$after}MB", PHP_EOL; } Expected result: ---------------- n-tilde: before=4.0695MB, after=20.0712MB multiplication: before=4.0697MB, after=12.0712MB cyrillic-i: before=4.0697MB, after=4.0712MB invalid: before=4.0697MB, after=4.0712MB Actual result: -------------- n-tilde: before=4.0694MB, after=20.0715MB multiplication: before=4.0696MB, after=20.0716MB cyrillic-i: before=4.0696MB, after=20.0716MB invalid: before=4.0696MB, after=20.0716MB -- Edit bug report at http://bugs.php.net/bug.php?id=54053&edit=1 -- Try a snapshot (PHP 5.2): http://bugs.php.net/fix.php?id=54053&r=trysnapshot52 Try a snapshot (PHP 5.3): http://bugs.php.net/fix.php?id=54053&r=trysnapshot53 Try a snapshot (trunk): http://bugs.php.net/fix.php?id=54053&r=trysnapshottrunk Fixed in SVN: http://bugs.php.net/fix.php?id=54053&r=fixed Fixed in SVN and need be documented: http://bugs.php.net/fix.php?id=54053&r=needdocs Fixed in release: http://bugs.php.net/fix.php?id=54053&r=alreadyfixed Need backtrace: http://bugs.php.net/fix.php?id=54053&r=needtrace Need Reproduce Script: http://bugs.php.net/fix.php?id=54053&r=needscript Try newer version: http://bugs.php.net/fix.php?id=54053&r=oldversion Not developer issue: http://bugs.php.net/fix.php?id=54053&r=support Expected behavior: http://bugs.php.net/fix.php?id=54053&r=notwrong Not enough info: http://bugs.php.net/fix.php?id=54053&r=notenoughinfo Submitted twice: http://bugs.php.net/fix.php?id=54053&r=submittedtwice register_globals: http://bugs.php.net/fix.php?id=54053&r=globals PHP 4 support discontinued: http://bugs.php.net/fix.php?id=54053&r=php4 Daylight Savings: http://bugs.php.net/fix.php?id=54053&r=dst IIS Stability: http://bugs.php.net/fix.php?id=54053&r=isapi Install GNU Sed: http://bugs.php.net/fix.php?id=54053&r=gnused Floating point limitations: http://bugs.php.net/fix.php?id=54053&r=float No Zend Extensions: http://bugs.php.net/fix.php?id=54053&r=nozend MySQL Configuration Error: http://bugs.php.net/fix.php?id=54053&r=mysqlcfg