I've had some time to think about this and Derick and I also kicked
around some ideas in a private conversation.
The situation I am talking about is really about exceptional
circumstances, such as ISO-8859-1 string being treated as a UTF-8 one
or some other condition that results in illegal sequences. This is very
different from an unassigned character condition, which is handled by
SUBST, SKIP, etc callbacks. I disagree with the notion that this is
similar to (int)"foo" example. There, we have a well defined semantics
that say "strings not starting with a number get converted to 0".
Treating ISO-8859-1 data as UTF-8 is simply invalid and bad behavior
and should not be encouraged by silently ignoring the conversion error.
Now, I understand that there is resistance to the use of exceptions in
this case and I see the point of those who are against them. My problem
is this: if we do not throw exceptions, then all we are left with is a
warning, which is not helpful if you want to determine in a
programmatic fashion whether there was a conversion error. Sure, you
can check the return value of unicode_decode(), or maybe even fread()
and such, but it does not help with casting, concatenation, and other
similar operations. So, we do need a mechanism for this and it has to
be a fairly flexible one because libraries may want to do one thing on
failure, and application itself -- another.
The best Derick and I could come up with is a user-specified conversion
error handler. It would be invoked only when the converter encounters
an illegal sequence or other serious error. The existing subst, skip,
etc error modes would still apply. The error handler signature would be
something like:
function my_handler($direction, $encoding, $string, $char_byte,
$offset) { .. }
Where $direction is the direction of conversion (FROM_UNICODE or
TO_UNICODE), $encoding is the name of the encoding in use during the
attempted conversion, $string is the source string that converter tried
to process, $char_byte is either failed Unicode character or byte
sequence (depending on direction), and $offset is the offset of that
character/byte sequence in the source string. The user error handler
then is free to silence the warning, throw an exception (throw
UnicodeConversionException($message, $direction, $char_byte, $offset),
or do something else. I have no yet decided whether it's a good idea to
allow user handler to continue the conversion or not. I'd rather the
conversion always stopped.
-Andrei
On Apr 13, 2006, at 4:02 PM, Andi Gutmans wrote:
Yeah but we can't only tailor to the default. If you cast "abc" to an
integer today PHP will do the conversion (e.g. 0). I think we should
stick to that paradigm and provide users with validation methods if
they want to strictly validate...
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php