Dear all,

a colleague of mine described the following issue at 
https://www.mediawiki.org/wiki/Extension_talk:External_Data#Problems_with_special_caracters_when_retrieving_Data_from_xml
 and I was able to find the root cause.

When retrieving XML (did not check other formats), and the actual value is 
non-ASCII, the XML parser calls the data handler for each piece (see 
https://secure.php.net/manual/en/function.xml-set-character-data-handler.php). 
So, let's assume the value is "grün" or "journée", the data handler is called 
twice for each of these values (1. "gr", 2. "ün", or 1. "journ", 2. "ée", 
resp.).

The data handler in get_web_data for XML is 
ED_Utils::getContent<https://github.com/wikimedia/mediawiki-extensions-ExternalData/blob/master/ED_Utils.php>.
 The current implementation of getContent adds a new element to the 
$edgXMLValues[$edgCurrentXMLTag] array every time it is called. So, it creates 
new elements for each piece.

My understanding is that only multi values should end up in different buckets:

  <colors>
    <color>blau</color>
    <color>grün</color>
    <color>rot</color>
  </colors>

should end up as

$edgXMLValues['color'][0] = 'blau'
$edgXMLValues['color'][1] = 'grün'
$edgXMLValues['color'][2] = 'rot'

and

  <greetings>
    <greeting>Bonne journée</greeting>
    <greeting>Bonne soirée</greeting>
  </greetings>

should end up as

$edgXMLValues['greeting'][0] = 'Bonne journée'
$edgXMLValues['greeting'][1] = 'Bonne soirée'

However, the current implementation returns:

$edgXMLValues['color'][0] = 'blau'
$edgXMLValues['color'][1] = 'gr'
$edgXMLValues['color'][2] = 'ün'
$edgXMLValues['color'][3] = 'rot'

$edgXMLValues['greeting'][0] = 'Bonne journ'
$edgXMLValues['greeting'][1] = 'ée'
$edgXMLValues['greeting'][2] = 'Bonne soir'
$edgXMLValues['greeting'][3] = 'ée'


IMHO, getContent should check whether it is called for the very same XML 
element and then, append the content to the last element's value.

Best regards
  Christian
_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Reply via email to