Hello,

The following is a draft idea about how to design a portable native/unicode
string api. Before i proceed with the real idea i want to investigate two
issues on a low level:

First some considerations regarding wide string literals (L"..."):

|ISO-C-89: 6.4.5:
|5 [...]; for wide string literals, the array elements have type wchar_t, and 
are
|initialized with the sequence of wide characters corresponding to the 
multibyte character
|sequence, as defined by the mbstowcs function with an implementation-defined 
current
|locale. [...]
|
|wchar_t is normally either unsigned short or unsigned int. While unsigned
|short would work in most cases wchar_t doesn't.
|
|ISO-C-89: 7.17 Common definitions <stddef.h>:
|2 [...] and wchar_t which is an integer type whose range of values can 
represent distinct
|codes for all members of the largest extended character set specified among 
the supported
|locales; the null character shall have the code value zero and each member of 
the basic
|character set defined in 5.2.1 shall have a code value equal to its value when 
used as the
|lone character in an integer character constant.

My linux and windows systems use this:
typedef unsigned short wchar_t;

However my X11 comes with this:
typedef unsigned long wchar_t;

Also in case of unsigned short we do not know whether that means UCS2-LE,
UCS2-BE or UTF-16. So we should forget about L".."

Now considering anonymous struct/union:

In ISO-C89 anonymous struct/union is defined but Andi thinks it is better
to not use that becasue he believes there were problems some time ago.

|6.7.2.1 Structure and union specifiers
|Syntax
|1
|
|struct-or-union-specifier:
|                          struct-or-union identifier(opt) { 
struct-declaration-list }
|                          struct-or-union identifier
|
|struct-or-union:
|                struct
|                union
|
|[...]

Note the (opt) here. In C++ (ISO/IEC 14882:1998(E)) unnamed structs/classes
are not mentioned at all. Unnamed unions are however explicitly described
and used. The common understanding of the ISO standard is that this means
unnamed struct/class is forbidden. This measn anonymous unions should work
and could be used to simplify the following by dropping some u's etc in the
following.

The first idea finally is to provide a simplified struct to hold a string
in eitehr native (8 byte) or unicode (2byte/UTF-16) plus a type designator
(IS_*). This is currently not possible so we have to change the data layout
of zval's a bit.

First i'll provide a zstr type consisting of type, char and length. Then
i will modify all so that we can deal with that zstr type as part of a
zval as well as using it alone for a string handling api. In other words
the zstr is the simplified string type as discussed during PDM.

typedef struct _zstr_value {
        union {
                char *val;              /* 8-bit legacy string type */
                UChar *val;             /* Unicode string type */
        } u;
        int32_t len;
} zstr_value;

typedef struct _zvalue_value {
        long lval;                                      /* long value */
        double dval;                            /* double value */
        zstr_value str;
        HashTable *ht;                          /* hash table value */
        zend_object_value obj;
} zvalue_value;


typedef struct _zval_string {
        zend_uchar type;                        /* active type */
        zstr_value value;                       /* string value */
} zstr;

typedef struct _zval_data {
        zend_uchar type;                        /* active type */
        zvalue_value value;                     /* value */
} zdata;

typedef struct _zval_struct zval;

struct _zval_struct {
        /* Variable information */
        union {
                zdata data;             /* value */
                zstr  str;
        } u;
        zend_uchar is_ref;
        zend_uint refcount;
};

Now the unicde string api (zu_ prefix) shall use either separate type,
string and length or a simplified string type. When having to provide
static consts for comparison or a like right now we need to convert the
the native string always which takes unneccessary time. And also it is
error prone because in a lot of places we simply forget about unicode.

zu_str*(zstr *s1, zstr *s2)
{
}

or seperate parameters:

zu_str*(zend_uchar type1, int len1, const void* str1,
       zend_uchar type2, int len2, const void* str2)
{
}

Now both have problems with static consts we would be left with separate
parameters and two parameters for the string. The semantics would be that
either char* and UChar* arg are identical and type is one of IS_STRING and
IS_UNICODE or that they differ and type is set -1. But that is not really
compatible with our zvals.

zu_str*(zend_uchar type1, int len1, const char* str1, const UChar* ustr1,
       zend_uchar type2, int len2, const char* str2, const UChar* ustr2)
{
}

However we could further more declare helper macros that circumvent the
problem as follows:

#define Z_STR_PASS(z) Z_TYPE(z), Z_STRLEN(z), Z_STRVAL(z), Z_UNIVAL(z)
#define Z_STR_PASS_P(pz) Z_STRPASS(*p)
#define Z_STR_PASS_PP(ppz) Z_STRPASS(**p)
#define ZU_CONST(str, ustr) -1, sizeof(str)-1, str, ustr

Now consider an api compliant string compare function:

int zu_strcmp(zend_uchar type1, int len1, const char* str1, const UChar* ustr1,
              zend_uchar type2, int len2, const char* str2, const UChar* ustr2)
{
...
}

And it's invocation:

if (zu_strcmp(Z_STR_PASS_P(zval), CONST_STR_PASS("parent", 
"p\0a\0r\0e\0n\0t"))) {
}

The handling seems quite easy to me and only has one negative aspect. That
is we would pass two string pointers where in many cases only one is
necessary. On the pro we have the ability of not needed to convert and a
consistent api.

Changing to the last we do not really need the zstr type and could go with
the old layout. However the zstr type can be helpful in quite some places.

Best regards,
 Marcus

-- 
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to