Edit report at http://bugs.php.net/bug.php?id=54053&edit=1

 ID:                 54053
 User updated by:    r3z at pr0j3ctr3z dot com
 Reported by:        r3z at pr0j3ctr3z dot com
 Summary:            iconv returns strings with excessive memory usage
 Status:             Bogus
 Type:               Bug
 Package:            ICONV related
 Operating System:   Windows XP SP3
 PHP Version:        5.2.17
 Block user comment: N
 Private report:     N

 New Comment:

Please re-open this bug report. The issue is as described.



Checking memory usage outside the loop shows stabilized memory usage
because the buffer which stores the results from iconv is unset, thereby
freeing the memory used. Regardless, this is not a memory manager
issue.



The problem is that the function php_iconv_string, in ext/iconv.c,
allocates an output buffer the same size as the input buffer and doesn't
reduce the size of the allocated memory block depending on the actual
size of the result before returning. This, in certain circumstances, can
waste an awful lot of memory.



The attached patch for ext/iconv.c taken from the PHP 5.2.17 source code
resolves this issue by modifying the function php_iconv_string so that
it resizes the output buffer to the actual size of the string it
contains before returning.


Previous Comments:
------------------------------------------------------------------------
[2011-02-19 20:39:05] scott...@php.net

.

------------------------------------------------------------------------
[2011-02-19 20:38:24] scott...@php.net

Already works like you describe, only the memory required is copied from
the iconv 

buffer.



Add a check outside the loop and you'll see its stabilised again back to
4mb. This 

just the way the memory manager works.

------------------------------------------------------------------------
[2011-02-19 06:43:23] r3z at pr0j3ctr3z dot com

Made minor alteration to the summary

------------------------------------------------------------------------
[2011-02-19 06:30:51] r3z at pr0j3ctr3z dot com

Description:
------------
PHP 5.2.17 / libiconv 1.11 / Windows XP SP3



It would appear that, on my machine at least, the result returned by
iconv uses the same amount of memory as the input string, even if it
doesn't actually need to. This only happens when the result is smaller
than the input string. When the result is bigger than the input string,
i.e. going from ISO-8859-1 characters above 0x7F, to UTF-8, the
resulting memory usage is as expected.



To demonstrate, the example code initializes an array of 4 UTF-8
strings, which I have named: n-tilde; multiplication; cyrillic-i; and
invalid. Each 1MB string is repeatedly (for dramatic effect)
transliterated to ASCII, and the resulting string is stored in a buffer
array. The memory usage before and after these repeated transliteration
is recorded and displayed. The difference in the memory usage before and
after, therefore closely approximates the memory usage of the buffer
array.



During the transliteration the following occurs:



n-tilde: each 2-byte UTF-8 character, U+00F1, is transliterated to the
2-byte ASCII sequence '~n', so each buffer should use 1MB.

multiplication: each 2-byte UTF-8 character, U+00D7, is transliterated
to the 1-byte ASCII sequence 'x', so each buffer should use 0.5MB.

cyrillic-i: each 2-byte UTF-8 character, U+0438, is ignored since there
is no transliteration. So iconv returns the empty string. Therefore,
each buffer should use 0MB.

invalid: 0xFF is invalid in UTF-8 so iconv stops processing the input
string at the first character, generates an E_NOTICE (which I mask to
make the output more readable) and returns the incomplete result, the
empty string. Therefore, each buffer should use 0MB.



I am aware that it takes ~68 bytes per entry, plus the size of the data
to store the array, however, in this case 16 entries, plus index
strings, only amounts to ~1KB, which is insignificant compared to the
results. Keeping this in mind though, you would expect additional memory
usage caused by the creation of the 16 entry, buffer array to be:



~16MB for n-tilde (16 buffers @ 1MB each);

~8MB for multiplication (16 buffers @ 0.5MB each);

~1KB for cyrillic-i (16 buffers @ 0MB each);

~1KB for invalid (16 buffers @ 0MB each).



This ties in very neatly with my expected results, as shown. However,
the actual results are significantly different. As you can see, the
buffer for each string uses 16MB. Note that this is 16 buffers @ 1MB
(the size of the input string). Obviously, this should not be the case.
An array of 16 empty strings, in the cases of the cyrillic-i and invalid
tests, should not use 16MB of memory. Although I haven't shown it here
for brevity, the contents of the buffer after, for example, the invalid
test, are indeed 16 empty strings which act like empty strings should.
They work just fine. They just use 1MB of memory each. When you strlen
them, they report being zero-length as you would expect. But they still
use 1MB each. The interesting thing about them is that if you
concatenate all the empty strings together and save it in a separate
string that string only uses a few bytes, as you would expect. So as
soon as you do any string operations of them, the resulting strings use
the expected amount of memory.



So to get the expected results shown here, I simply cast the result of
the iconv call as a string, i.e. $buffer = (string)@iconv(...);. Now,
obviously, at least logically, this should make no difference. After
all, I'm casting a string as a string. But since casts in PHP are an
operator they return a new value. In this case, a new string with the
same value and corrected memory usage.



You can change the number of repetitions, and/or the input string sizes.
The pattern remains the same. The result strings (if smaller) always end
up using the same amount of memory as the input string. Change the to-
and from- charsets, the pattern remains. Remove the ignore and/or
translit flags, it doesn't matter. You still end up with strings that
take up more space than they should.



I looked at the iconv source code, and to be honest, as I'm not a
developer of PHP or PHP modules, it didn't make a whole lot of sense,
and I didn't spend a whole lot of time trying to get my head around it.
That's for another day/year/life :) I don't know the inner workings of
PHP or how it passes data around, or how that ends up as a PHP value
accessible in PHP script. But I do understand the principles. Anyway, my
best assumption is that when PHP's iconv wrapper is called, an output
buffer the size of the input buffer is created and passed to libiconv.
When libiconv returns, PHP's iconv wrapper then packages that buffer as
a PHP string and makes it accessible to the PHP script. The results
shown here would indicate that, nowhere along the way is the output
buffer's memory allocation shrunk to fit the size of the actual data
returned. Therefore, you end up with a PHP empty string (for example)
that actually uses 1MB of data.



Test script:
---------------
$strings = array(

    'n-tilde' => str_repeat("\xC3\xB1", 512 * 1024),

    'multiplication' => str_repeat("\xC3\x97", 512 * 1024),

    'cyrillic-i' => str_repeat("\xD0\xB8", 512 * 1024),

    'invalid' => str_repeat("\xFF", 1024 * 1024),

);

foreach ($strings as $name => $value) {

    $before = round(memory_get_usage() / (1024 * 1024), 4);

    $buffer = array();

    for ($i = 0; $i < 16; ++$i)

        $buffer[] = @iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $value);

    $after = round(memory_get_usage() / (1024 * 1024), 4);

    unset($buffer);

    echo "{$name}:  before={$before}MB, after={$after}MB", PHP_EOL;

}



Expected result:
----------------
n-tilde:  before=4.0695MB, after=20.0712MB

multiplication:  before=4.0697MB, after=12.0712MB

cyrillic-i:  before=4.0697MB, after=4.0712MB

invalid:  before=4.0697MB, after=4.0712MB



Actual result:
--------------
n-tilde:  before=4.0694MB, after=20.0715MB

multiplication:  before=4.0696MB, after=20.0716MB

cyrillic-i:  before=4.0696MB, after=20.0716MB

invalid:  before=4.0696MB, after=20.0716MB




------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=54053&edit=1

Reply via email to