From:             
Operating system: Microsoft Windows XP SP3
PHP version:      5.2.17
Package:          ICONV related
Bug Type:         Bug
Bug description:ICONV returns strings with excessive memory useage

Description:
------------
PHP 5.2.17 / libiconv 1.11 / Windows XP SP3



It would appear that, on my machine at least, the result returned by iconv
uses the same amount of memory as the input string, even if it doesn't
actually need to. This only happens when the result is smaller than the
input string. When the result is bigger than the input string, i.e. going
from ISO-8859-1 characters above 0x7F, to UTF-8, the resulting memory usage
is as expected.



To demonstrate, the example code initializes an array of 4 UTF-8 strings,
which I have named: n-tilde; multiplication; cyrillic-i; and invalid. Each
1MB string is repeatedly (for dramatic effect) transliterated to ASCII, and
the resulting string is stored in a buffer array. The memory usage before
and after these repeated transliteration is recorded and displayed. The
difference in the memory usage before and after, therefore closely
approximates the memory usage of the buffer array.



During the transliteration the following occurs:



n-tilde: each 2-byte UTF-8 character, U+00F1, is transliterated to the
2-byte ASCII sequence '~n', so each buffer should use 1MB.

multiplication: each 2-byte UTF-8 character, U+00D7, is transliterated to
the 1-byte ASCII sequence 'x', so each buffer should use 0.5MB.

cyrillic-i: each 2-byte UTF-8 character, U+0438, is ignored since there is
no transliteration. So iconv returns the empty string. Therefore, each
buffer should use 0MB.

invalid: 0xFF is invalid in UTF-8 so iconv stops processing the input
string at the first character, generates an E_NOTICE (which I mask to make
the output more readable) and returns the incomplete result, the empty
string. Therefore, each buffer should use 0MB.



I am aware that it takes ~68 bytes per entry, plus the size of the data to
store the array, however, in this case 16 entries, plus index strings, only
amounts to ~1KB, which is insignificant compared to the results. Keeping
this in mind though, you would expect additional memory usage caused by the
creation of the 16 entry, buffer array to be:



~16MB for n-tilde (16 buffers @ 1MB each);

~8MB for multiplication (16 buffers @ 0.5MB each);

~1KB for cyrillic-i (16 buffers @ 0MB each);

~1KB for invalid (16 buffers @ 0MB each).



This ties in very neatly with my expected results, as shown. However, the
actual results are significantly different. As you can see, the buffer for
each string uses 16MB. Note that this is 16 buffers @ 1MB (the size of the
input string). Obviously, this should not be the case. An array of 16 empty
strings, in the cases of the cyrillic-i and invalid tests, should not use
16MB of memory. Although I haven't shown it here for brevity, the contents
of the buffer after, for example, the invalid test, are indeed 16 empty
strings which act like empty strings should. They work just fine. They just
use 1MB of memory each. When you strlen them, they report being zero-length
as you would expect. But they still use 1MB each. The interesting thing
about them is that if you concatenate all the empty strings together and
save it in a separate string that string only uses a few bytes, as you
would expect. So as soon as you do any string operations of them, the
resulting strings use the expected amount of memory.



So to get the expected results shown here, I simply cast the result of the
iconv call as a string, i.e. $buffer = (string)@iconv(...);. Now,
obviously, at least logically, this should make no difference. After all,
I'm casting a string as a string. But since casts in PHP are an operator
they return a new value. In this case, a new string with the same value and
corrected memory usage.



You can change the number of repetitions, and/or the input string sizes.
The pattern remains the same. The result strings (if smaller) always end up
using the same amount of memory as the input string. Change the to- and
from- charsets, the pattern remains. Remove the ignore and/or translit
flags, it doesn't matter. You still end up with strings that take up more
space than they should.



I looked at the iconv source code, and to be honest, as I'm not a developer
of PHP or PHP modules, it didn't make a whole lot of sense, and I didn't
spend a whole lot of time trying to get my head around it. That's for
another day/year/life :) I don't know the inner workings of PHP or how it
passes data around, or how that ends up as a PHP value accessible in PHP
script. But I do understand the principles. Anyway, my best assumption is
that when PHP's iconv wrapper is called, an output buffer the size of the
input buffer is created and passed to libiconv. When libiconv returns,
PHP's iconv wrapper then packages that buffer as a PHP string and makes it
accessible to the PHP script. The results shown here would indicate that,
nowhere along the way is the output buffer's memory allocation shrunk to
fit the size of the actual data returned. Therefore, you end up with a PHP
empty string (for example) that actually uses 1MB of data.



Test script:
---------------
$strings = array(

    'n-tilde' => str_repeat("\xC3\xB1", 512 * 1024),

    'multiplication' => str_repeat("\xC3\x97", 512 * 1024),

    'cyrillic-i' => str_repeat("\xD0\xB8", 512 * 1024),

    'invalid' => str_repeat("\xFF", 1024 * 1024),

);

foreach ($strings as $name => $value) {

    $before = round(memory_get_usage() / (1024 * 1024), 4);

    $buffer = array();

    for ($i = 0; $i < 16; ++$i)

        $buffer[] = @iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $value);

    $after = round(memory_get_usage() / (1024 * 1024), 4);

    unset($buffer);

    echo "{$name}:  before={$before}MB, after={$after}MB", PHP_EOL;

}



Expected result:
----------------
n-tilde:  before=4.0695MB, after=20.0712MB

multiplication:  before=4.0697MB, after=12.0712MB

cyrillic-i:  before=4.0697MB, after=4.0712MB

invalid:  before=4.0697MB, after=4.0712MB



Actual result:
--------------
n-tilde:  before=4.0694MB, after=20.0715MB

multiplication:  before=4.0696MB, after=20.0716MB

cyrillic-i:  before=4.0696MB, after=20.0716MB

invalid:  before=4.0696MB, after=20.0716MB



-- 
Edit bug report at http://bugs.php.net/bug.php?id=54053&edit=1
-- 
Try a snapshot (PHP 5.2):            
http://bugs.php.net/fix.php?id=54053&r=trysnapshot52
Try a snapshot (PHP 5.3):            
http://bugs.php.net/fix.php?id=54053&r=trysnapshot53
Try a snapshot (trunk):              
http://bugs.php.net/fix.php?id=54053&r=trysnapshottrunk
Fixed in SVN:                        
http://bugs.php.net/fix.php?id=54053&r=fixed
Fixed in SVN and need be documented: 
http://bugs.php.net/fix.php?id=54053&r=needdocs
Fixed in release:                    
http://bugs.php.net/fix.php?id=54053&r=alreadyfixed
Need backtrace:                      
http://bugs.php.net/fix.php?id=54053&r=needtrace
Need Reproduce Script:               
http://bugs.php.net/fix.php?id=54053&r=needscript
Try newer version:                   
http://bugs.php.net/fix.php?id=54053&r=oldversion
Not developer issue:                 
http://bugs.php.net/fix.php?id=54053&r=support
Expected behavior:                   
http://bugs.php.net/fix.php?id=54053&r=notwrong
Not enough info:                     
http://bugs.php.net/fix.php?id=54053&r=notenoughinfo
Submitted twice:                     
http://bugs.php.net/fix.php?id=54053&r=submittedtwice
register_globals:                    
http://bugs.php.net/fix.php?id=54053&r=globals
PHP 4 support discontinued:          http://bugs.php.net/fix.php?id=54053&r=php4
Daylight Savings:                    http://bugs.php.net/fix.php?id=54053&r=dst
IIS Stability:                       
http://bugs.php.net/fix.php?id=54053&r=isapi
Install GNU Sed:                     
http://bugs.php.net/fix.php?id=54053&r=gnused
Floating point limitations:          
http://bugs.php.net/fix.php?id=54053&r=float
No Zend Extensions:                  
http://bugs.php.net/fix.php?id=54053&r=nozend
MySQL Configuration Error:           
http://bugs.php.net/fix.php?id=54053&r=mysqlcfg

Reply via email to