Re: [PATCH] utf8 flag support on perl lib

Peter J. Holzer Thu, 10 Jan 2008 03:49:54 -0800

On 2008-01-10 13:09:07 +0300, Tomash Brechko wrote:
> On Thu, Jan 10, 2008 at 18:21:44 +0900, [EMAIL PROTECTED] wrote:
> > Currently, perl library of memcached (Cache::Memcached) will lost
> > utf8 flag when treat scalar values. However for reference values,
> > the flag will be kept transparently by Storable.
> > 
> > I wrote a patch and test to keep utf8 flag of scalar values.
> > Please consider to merge the patch?
> 
> Cache::Memcached::Fast doesn't preserve UTF-8 and tainted flags
> either.


Losing the utf8 flag changes the value, losing the taint flag doesn't -
the latter has quite different implications which should maybe be
discussed in a different thread.


> Some time ago I implemented the fix for C::M very close to your patch,
> but decided not to send it out.  The problem with the proposed
> solution is that (man Encode):
> 
>   Messing with Perl's Internals
> 
>        The following API uses parts of Perl's internals in the current imple-
>        mentation.  As such, they are efficient but may change.
> 
>        is_utf8(STRING [, CHECK])
>          [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
>          If CHECK is true, also checks the data in STRING for being well-
>          formed UTF-8.  Returns true if successful, false otherwise.
> 
>          As of perl 5.8.1, utf8 also has utf8::is_utf8().
> 
>        _utf8_on(STRING)
>          ...
> 
> In other words, the functions you have to use are internal, subject to
> change without notice, and appear only in the recent versions of Perl.

I would like these functions to change since they are horribly named.
However, I don't expect that to happen. Given perl's attitude towards
backward compatibility, they'll probably stay in 5.x forever, even if
the underlying implementation changes from UTF-8 to UCS-4 (or whatever).

So I wouldn't worry about that. 

(doing UTF-8 conversions only if $^V >= 5.8.0 is quite easy although a
bit of overhead, so that could be fixed, but C::M already seems to
require 5.8.x anyway).

Not being able to store and retrieve arbitrary string values with C::M
is IMHO a serious bug which should be fixed. Alternative solutions would
be:

* Always do a encode_utf8 in set and a decode_utf8 in get. This has two 
  drawbacks: 

  1) it breaks compatibility for binary strings (and since few people
     seem to have noticed this bug yet I expect that most people store
     binary strings).
  2) It adds some overhead for binary strings (I don't know how large
     that is - benchmarks would be needed).

* Detect character strings in set and freeze them - but that needs some
  additional magic as well and looks rather messy. 

* Add a flag to set (and possibly get) which caused the value to be
  UTF-8 encoded before storage. Presumably the application knows whether
  a given string is a character or byte string so that's a rather clean
  way. However it relies on the programmer to explicitely say something
  which the module can detect itself, which is strange given that C::M
  already handles scalars and references by itself.

So in conclusion I think that sugi's approach is better than the
alternatives.


> Besides, more often than not you use memcached client together with
> some other means to get the data if it's missing from the cache.
> While you may fix C::M(::F), not every other backend preserves these
> flags automatically.

These are the primary means for storing your data. If they can't handle
your data you've got a problem :-). When you design your application you
know (or should know) what data you want to store and choose your data
model and storage system accordingly. If MySQL can't do it, use Oracle
(or vice versa); if a varchar column can't do it, use a blob; etc.

But Cache::Memcached is supposed to be a general purpose cache: If you
store a value and successfully read it back you expect to get the same
value and not some other value.

        hp

-- 
   _  | Peter J. Holzer    | It took a genius to create [TeX],
|_|_) | Sysadmin WSR       | and it takes a genius to maintain it.
| |   | [EMAIL PROTECTED]         | That's not engineering, that's art.
__/   | http://www.hjp.at/ |    -- David Kastrup in comp.text.tex

signature.asc
Description: Digital signature

Re: [PATCH] utf8 flag support on perl lib

Reply via email to