ID: 20809
User updated by: flying at dom dot natm dot ru
Reported By: flying at dom dot natm dot ru
Status: Closed
Bug Type: Feature/Change Request
Operating System: All
PHP Version: 4.3.0RC2
New Comment:
Below is PHP example of how such code may looks like. It converts given
string from UTF-8 into specified encoding.
Notice about difference between utf8ToEntities() and
utf8ToEntitiesMultibyte(): first function converts every char
in a string into numeric entity while second only converts chars with
codes above 0x0800. It allows for example
receive normal string with single numeric entity in a case, when there
is only one uncovertable character in it.
// Convert string from UTF-8 into specified encoding and substitute
unconvertable characters by numeric entities
// At enter:
// $str - string to convert
function fromUTF8($str,$encoding)
{
if ($str===null)
return(null);
$t = iconv('utf-8',$encoding,$str);
if (($t=='') && ($str!=''))
// iconv() is unable to convert this string into requested encoding.
{
// First of all try to convert only multibyte characters. It may help
us to return text in requested encoding
// with only exception of a few very special chars instead of having
all text to be converted in entities.
$str2 = utf8ToEntitiesMultibyte($str);
$t = iconv('utf-8',$encoding,$str2);
if ($t!='')
return($t);
else
return(utf8ToEntities($str));
};
return($t);
}
// Convert multibyte characters, available into UTF-8 encoded string
into numeric entities (as described into RFC 2044)
// At enter:
// $str - string into UTF-8 encoding
function utf8ToEntitiesMultibyte($str)
{
if (!is_string($str))
return('');
$i = 0;
$output = '';
while($i<strlen($str))
{
$char = $str{$i};
if ((ord($char) & 0x80)==0)
// 0000 0000-0000 007F 0xxxxxxx
{
$output .= $char;
$i++;
}
elseif ((ord($char)>0xC0) && (ord($char)<=0xDF))
// 0000 0080-0000 07FF 110xxxxx 10xxxxxx
{
$output .= substr($str,$i,2);
$i += 2;
}
else
{
$num = 0;
if ((ord($char) & 0xFC)==0xFC)
// 0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
{
$num = (ord($str{$i+5}) & 0x3F) |
((ord($str{$i+4}) & 0x3F) << 6 ) |
((ord($str{$i+3}) & 0x3F) << 12) |
((ord($str{$i+2}) & 0x3F) << 18) |
((ord($str{$i+1}) & 0x3F) << 24) |
((ord($str{$i+0}) & 0x01) << 30);
$i += 6;
}
elseif ((ord($char) & 0xF8)==0xF8)
// 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
{
$num = (ord($str{$i+4}) & 0x3F) |
((ord($str{$i+3}) & 0x3F) << 6 ) |
((ord($str{$i+2}) & 0x3F) << 12) |
((ord($str{$i+1}) & 0x3F) << 18) |
((ord($str{$i+0}) & 0x03) << 24);
$i += 5;
}
elseif ((ord($char) & 0xF0)==0xF0)
// 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
{
$num = (ord($str{$i+3}) & 0x3F) |
((ord($str{$i+2}) & 0x3F) << 6 ) |
((ord($str{$i+1}) & 0x3F) << 12) |
((ord($str{$i+0}) & 0x07) << 18);
$i += 4;
}
elseif ((ord($char) & 0xE0)==0xE0)
// 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
{
$num = (ord($str{$i+2}) & 0x3F) |
((ord($str{$i+1}) & 0x3F) << 6 ) |
((ord($str{$i+0}) & 0x0F) << 12);
$i += 3;
}
else
// We should never came here until passed string is not UTF-8,
// but without this we're risk to fall in endless loop
{
$num = ord($char);
$i++;
};
$output .= '&#'.$num.';';
};
};
return($output);
}
// Convert UTF-8 encoded string into numeric entities (as described
into RFC 2044)
// At enter:
// $str - string into UTF-8 encoding
function utf8ToEntities($str)
{
if (!is_string($str))
return('');
$i = 0;
$output = '';
while($i<strlen($str))
{
$char = $str{$i};
if ((ord($char) & 0x80)==0)
// 0000 0000-0000 007F 0xxxxxxx
{
$output .= $char;
$i++;
}
else
{
$num = 0;
if ((ord($char) & 0xFC)==0xFC)
// 0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
{
$num = (ord($str{$i+5}) & 0x3F) |
((ord($str{$i+4}) & 0x3F) << 6 ) |
((ord($str{$i+3}) & 0x3F) << 12) |
((ord($str{$i+2}) & 0x3F) << 18) |
((ord($str{$i+1}) & 0x3F) << 24) |
((ord($str{$i+0}) & 0x01) << 30);
$i += 6;
}
elseif ((ord($char) & 0xF8)==0xF8)
// 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
{
$num = (ord($str{$i+4}) & 0x3F) |
((ord($str{$i+3}) & 0x3F) << 6 ) |
((ord($str{$i+2}) & 0x3F) << 12) |
((ord($str{$i+1}) & 0x3F) << 18) |
((ord($str{$i+0}) & 0x03) << 24);
$i += 5;
}
elseif ((ord($char) & 0xF0)==0xF0)
// 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
{
$num = (ord($str{$i+3}) & 0x3F) |
((ord($str{$i+2}) & 0x3F) << 6 ) |
((ord($str{$i+1}) & 0x3F) << 12) |
((ord($str{$i+0}) & 0x07) << 18);
$i += 4;
}
elseif ((ord($char) & 0xE0)==0xE0)
// 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
{
$num = (ord($str{$i+2}) & 0x3F) |
((ord($str{$i+1}) & 0x3F) << 6 ) |
((ord($str{$i+0}) & 0x0F) << 12);
$i += 3;
}
elseif ((ord($char) & 0xC0)==0xC0)
// 0000 0080-0000 07FF 110xxxxx 10xxxxxx
{
$num = (ord($str{$i+1}) & 0x3F) |
((ord($str{$i+0}) & 0x1F) << 6 );
$i += 2;
}
else
// We should never came here until passed string is not UTF-8,
// but without this we're risk to fall in endless loop
{
$num = ord($char);
$i++;
};
$output .= '&#'.$num.';';
};
};
return($output);
}
Previous Comments:
------------------------------------------------------------------------
[2003-07-02 13:47:40] Xuefer at 21cn dot com
it is said libxml2 does it this way(into numeric entities)
using iconv, means that it's possible, but i'm not sure
if it's possible, i guess it should be ok for php itself to implement
"//IGNORE"
simply scan for //IGNORE itself, then do copies whenever get
unconvertable error
it's badly needed to avoid truncate to the content only 1 char is
unconvertable.
many thanks
------------------------------------------------------------------------
[2002-12-04 23:43:32] [EMAIL PROTECTED]
You can achieve that by appending "//IGNORE" after the codeset name to
which the string is going to be converted.
For example:
<?php
$bar = iconv("UTF-8", "KOI-8R//IGNORE", $foo);
?>
Note that this is not portable since most of the iconv implementations
don't support it. As far as I know, only glibc's iconv can handle
this.
------------------------------------------------------------------------
[2002-12-04 08:16:39] flying at dom dot natm dot ru
It will be very useful to have support for -c and -s options available
for iconv command-line tool as optional arguments for iconv()
function.
And also it will be specially useful for XML related code to have an
option to convert all unconvertable characters into numeric entities.
Thank you all for your job!
------------------------------------------------------------------------
--
Edit this bug report at http://bugs.php.net/?id=20809&edit=1