From:             gros at mpdl dot mpg dot de
Operating system: Mac OS-X 10.6.2
PHP version:      5.3.0
PHP Bug Type:     XML Reader
Bug description:  text in UTF-8 encoded xml cut off by xml parser with German 
umlauts

Description:
------------
When parsing an xml file with UTF-8 encoding (like this one:
http://bit.ly/3PSi44), text containing German umlauts is cut off:

original:
<e:organization-name>Kaiser Wilhelm Institut für
Züchtungsforschung</e:organization-name>

result after parsing:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societäts-Verlag</dc:publisher>

results in "äts-Verlag"


Reproduce code:
---------------
$snippet = file_get_contents("http://bit.ly/3PSi44";);

if (!($xml_parser = xml_parser_create("")))     
                                die("Couldn't create parser.");
                                                
xml_parser_set_option($xml_parser,
XML_OPTION_TARGET_ENCODING,'UTF-8');  
                                                
xml_set_element_handler($xml_parser,"startElementHandler","endElementHandler");
                                                xml_set_character_data_handler( 
$xml_parser,
"characterDataHandler");

                                                $retstr = "";
                                                if(!xml_parse($xml_parser, 
$snippet)) 
                                                        {
                                                        $retstr = sprintf("XML 
error: %s at line %d",
                                                                                
                xml_error_string(xml_get_error_code($xml_parser)),
                                                                                
                xml_get_current_line_number($xml_parser));
                                                        }
                                                xml_parser_free($xml_parser);




Expected result:
----------------
I expect properly imported text like outlined in the description:

parsing this:
<e:organization-name>Kaiser Wilhelm Institut für
Züchtungsforschung</e:organization-name>

should result in:
"Kaiser Wilhelm Institut für Züchtungsforschung"

or parsing this
<dc:publisher>Societäts-Verlag</dc:publisher>

should result in "Societäts-Verlag"

Actual result:
--------------
I get cut-off pieces of text when the text contains German umlauts (see
two examples in the description).

parsing this:
<e:organization-name>Kaiser Wilhelm Institut für
Züchtungsforschung</e:organization-name>

results in:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societäts-Verlag</dc:publisher>

results in "äts-Verlag"

-- 
Edit bug report at http://bugs.php.net/?id=50139&edit=1
-- 
Try a snapshot (PHP 5.2):            
http://bugs.php.net/fix.php?id=50139&r=trysnapshot52
Try a snapshot (PHP 5.3):            
http://bugs.php.net/fix.php?id=50139&r=trysnapshot53
Try a snapshot (PHP 6.0):            
http://bugs.php.net/fix.php?id=50139&r=trysnapshot60
Fixed in SVN:                        
http://bugs.php.net/fix.php?id=50139&r=fixed
Fixed in SVN and need be documented: 
http://bugs.php.net/fix.php?id=50139&r=needdocs
Fixed in release:                    
http://bugs.php.net/fix.php?id=50139&r=alreadyfixed
Need backtrace:                      
http://bugs.php.net/fix.php?id=50139&r=needtrace
Need Reproduce Script:               
http://bugs.php.net/fix.php?id=50139&r=needscript
Try newer version:                   
http://bugs.php.net/fix.php?id=50139&r=oldversion
Not developer issue:                 
http://bugs.php.net/fix.php?id=50139&r=support
Expected behavior:                   
http://bugs.php.net/fix.php?id=50139&r=notwrong
Not enough info:                     
http://bugs.php.net/fix.php?id=50139&r=notenoughinfo
Submitted twice:                     
http://bugs.php.net/fix.php?id=50139&r=submittedtwice
register_globals:                    
http://bugs.php.net/fix.php?id=50139&r=globals
PHP 4 support discontinued:          http://bugs.php.net/fix.php?id=50139&r=php4
Daylight Savings:                    http://bugs.php.net/fix.php?id=50139&r=dst
IIS Stability:                       
http://bugs.php.net/fix.php?id=50139&r=isapi
Install GNU Sed:                     
http://bugs.php.net/fix.php?id=50139&r=gnused
Floating point limitations:          
http://bugs.php.net/fix.php?id=50139&r=float
No Zend Extensions:                  
http://bugs.php.net/fix.php?id=50139&r=nozend
MySQL Configuration Error:           
http://bugs.php.net/fix.php?id=50139&r=mysqlcfg

Reply via email to