ID:               20809
 User updated by:  flying at dom dot natm dot ru
 Reported By:      flying at dom dot natm dot ru
 Status:           Closed
 Bug Type:         Feature/Change Request
 Operating System: All
 PHP Version:      4.3.0RC2
 New Comment:

Below is PHP example of how such code may looks like. It converts given
string from UTF-8 into specified encoding. 
Notice about difference between utf8ToEntities() and
utf8ToEntitiesMultibyte(): first function converts every char 
in a string into numeric entity while second only converts chars with
codes above 0x0800. It allows for example 
receive normal string with single numeric entity in a case, when there
is only one uncovertable character in it.

// Convert string from UTF-8 into specified encoding and substitute
unconvertable characters by numeric entities
// At enter:
//   $str - string to convert
    function fromUTF8($str,$encoding)
    {
        if ($str===null)
            return(null);
        $t = iconv('utf-8',$encoding,$str);
        if (($t=='') && ($str!=''))
// iconv() is unable to convert this string into requested encoding.
        {
// First of all try to convert only multibyte characters. It may help
us to return text in requested encoding
// with only exception of a few very special chars instead of having
all text to be converted in entities.
            $str2 = utf8ToEntitiesMultibyte($str);
            $t = iconv('utf-8',$encoding,$str2);
            if ($t!='')
                return($t);
            else
                return(utf8ToEntities($str));
        };
        return($t);
    }

// Convert multibyte characters, available into UTF-8 encoded string
into numeric entities (as described into RFC 2044)
// At enter:
//   $str - string into UTF-8 encoding
    function utf8ToEntitiesMultibyte($str)
    {
        if (!is_string($str))
            return('');
        $i = 0;
        $output = '';
        while($i<strlen($str))
        {
            $char = $str{$i};
            if ((ord($char) & 0x80)==0)
//   0000 0000-0000 007F   0xxxxxxx
                {
                    $output .= $char;
                     $i++;
                }
            elseif ((ord($char)>0xC0) && (ord($char)<=0xDF))
//   0000 0080-0000 07FF   110xxxxx 10xxxxxx
                {
                    $output .= substr($str,$i,2);
                    $i += 2;
                }
            else
                {
                    $num = 0;
                    if ((ord($char) & 0xFC)==0xFC)
//   0400 0000-7FFF FFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
                        {
                            $num = (ord($str{$i+5}) & 0x3F) |
                                  ((ord($str{$i+4}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+3}) & 0x3F) << 12) |
                                  ((ord($str{$i+2}) & 0x3F) << 18) |
                                  ((ord($str{$i+1}) & 0x3F) << 24) |
                                  ((ord($str{$i+0}) & 0x01) << 30);
                            $i += 6;
                        }
                    elseif ((ord($char) & 0xF8)==0xF8)
//   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
                        {
                            $num = (ord($str{$i+4}) & 0x3F) |
                                  ((ord($str{$i+3}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+2}) & 0x3F) << 12) |
                                  ((ord($str{$i+1}) & 0x3F) << 18) |
                                  ((ord($str{$i+0}) & 0x03) << 24);
                            $i += 5;
                        }
                    elseif ((ord($char) & 0xF0)==0xF0)
//   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+3}) & 0x3F) |
                                  ((ord($str{$i+2}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+1}) & 0x3F) << 12) |
                                  ((ord($str{$i+0}) & 0x07) << 18);
                            $i += 4;
                        }
                    elseif ((ord($char) & 0xE0)==0xE0)
//   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+2}) & 0x3F) |
                                  ((ord($str{$i+1}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+0}) & 0x0F) << 12);
                            $i += 3;
                        }
                    else
// We should never came here until passed string is not UTF-8,
// but without this we're risk to fall in endless loop
                        {
                            $num = ord($char);
                            $i++;
                        };
                    $output .= '&#'.$num.';';
                };
        };
        return($output);
    }

// Convert UTF-8 encoded string into numeric entities (as described
into RFC 2044)
// At enter:
//   $str - string into UTF-8 encoding
    function utf8ToEntities($str)
    {
        if (!is_string($str))
            return('');
        $i = 0;
        $output = '';
        while($i<strlen($str))
        {
            $char = $str{$i};
            if ((ord($char) & 0x80)==0)
//   0000 0000-0000 007F   0xxxxxxx
                {
                    $output .= $char;
                     $i++;
                }
            else
                {
                    $num = 0;
                    if ((ord($char) & 0xFC)==0xFC)
//   0400 0000-7FFF FFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
                        {
                            $num = (ord($str{$i+5}) & 0x3F) |
                                  ((ord($str{$i+4}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+3}) & 0x3F) << 12) |
                                  ((ord($str{$i+2}) & 0x3F) << 18) |
                                  ((ord($str{$i+1}) & 0x3F) << 24) |
                                  ((ord($str{$i+0}) & 0x01) << 30);
                            $i += 6;
                        }
                    elseif ((ord($char) & 0xF8)==0xF8)
//   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
                        {
                            $num = (ord($str{$i+4}) & 0x3F) |
                                  ((ord($str{$i+3}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+2}) & 0x3F) << 12) |
                                  ((ord($str{$i+1}) & 0x3F) << 18) |
                                  ((ord($str{$i+0}) & 0x03) << 24);
                            $i += 5;
                        }
                    elseif ((ord($char) & 0xF0)==0xF0)
//   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+3}) & 0x3F) |
                                  ((ord($str{$i+2}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+1}) & 0x3F) << 12) |
                                  ((ord($str{$i+0}) & 0x07) << 18);
                            $i += 4;
                        }
                    elseif ((ord($char) & 0xE0)==0xE0)
//   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+2}) & 0x3F) |
                                  ((ord($str{$i+1}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+0}) & 0x0F) << 12);
                            $i += 3;
                        }
                    elseif ((ord($char) & 0xC0)==0xC0)
//   0000 0080-0000 07FF   110xxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+1}) & 0x3F) |
                                  ((ord($str{$i+0}) & 0x1F) << 6 );
                            $i += 2;
                        }
                    else
// We should never came here until passed string is not UTF-8,
// but without this we're risk to fall in endless loop
                        {
                            $num = ord($char);
                            $i++;
                        };
                    $output .= '&#'.$num.';';
                };
        };
        return($output);
    }


Previous Comments:
------------------------------------------------------------------------

[2003-07-02 13:47:40] Xuefer at 21cn dot com

it is said libxml2 does it this way(into numeric entities)
using iconv, means that it's possible, but i'm not sure

if it's possible, i guess it should be ok for php itself to implement
"//IGNORE"
simply scan for //IGNORE itself, then do copies whenever get
unconvertable error

it's badly needed to avoid truncate to the content only 1 char is
unconvertable.
many thanks

------------------------------------------------------------------------

[2002-12-04 23:43:32] [EMAIL PROTECTED]

You can achieve that by appending "//IGNORE" after the codeset name to
which the string is going to be converted.

For example:
<?php
  $bar = iconv("UTF-8", "KOI-8R//IGNORE", $foo);
?>

Note that this is not portable since most of the iconv implementations
don't support it. As far as I know, only glibc's iconv can handle
this.


------------------------------------------------------------------------

[2002-12-04 08:16:39] flying at dom dot natm dot ru

 It will be very useful to have support for -c and -s options available
for iconv command-line tool as optional arguments for iconv()
function.
 And also it will be specially useful for XML related code to have an
option to convert all unconvertable characters into numeric entities.

 Thank you all for your job!

------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=20809&edit=1

Reply via email to