Re: [PATCH] utf8 flag support on perl lib

Peter J. Holzer Sat, 12 Jan 2008 04:38:58 -0800

On 2008-01-11 11:03:02 +0300, Tomash Brechko wrote:
> On Thu, Jan 10, 2008 at 19:55:50 +0100, Peter J. Holzer wrote:
> > I really think that:
> > 
> >  my $var = "some arbitrary string";
> >  $memcached->set("key", $var);
> >  my $var2 = $memcached->get("key");
> >  is($var, $var2);
> > 
> > should always succeed.
> 
> That would be nice if that would work auotomagically, however things
> are more complex than that.  Suppose you have two strings, one $text
> is a true UTF-8 string (it has the UTF-8 flag set and there are also
> bytes with high bit set).  Another $binary is the same sequence of
> bytes, but UTF-8 is not set for it, it's a binary data.  Now what if
> you store both strings into the same (plain) file?  When you will be
> reading them back, you either have to open the file in a :raw mode, or
> in a :utf8 mode, and _both_ strings will have UTF-8 flag cleared or
> set (or you have to read them separately in different modes).


But they will still compare equal to the original strings, if you did it
correctly.


> This is natural, because this flag is not stored along with the data,
> instead, when you read the data back the decision about setting UTF-8
> flag is made anew each time.

Please don't think about it as a "flag". That's an implementation
detail. Think about "byte strings" and "character strings" as different
types: A perl scalar can be:

* A byte string

* A character string

* A signed integer

* An unsigned integer

* A floating point value

* A reference

(actually, it's a bit more complicated bacause it can be several of
these types at once, but I think we can ignore that for the moment).

Normally, as a Perl programmer, you don't care which type a scalar is,
because perl will convert it behind the scenes if you use the scalar in
a context where a different type is needed. But you do care about the
value: If you store a scalar somewhere and you read it back, you expect
to get the same value, but not necessarily the same type. Now lets see
how Cache::Memcached handles these types:

* A byte string: That's the simple case which "just works".

* A character string: set performs an implicit encode_utf8() on the data
  (not by design - it's a side effect of the internal representation of
  perl character strings and the implementation of send()). get doesn't
  do the reverse operation, so the value changes.

* A signed integer: Set implicitely converts the number to a decimal
  string. Get doesn't convert the string back to a number, but the first
  time the string will be used in a numeric context perl will do it, so
  it doesn't matter.

* An unsigned integer: Same as a signed integer.

* A floating point value: Same as the integers, except that on at least
  some perl implementations the FP -> string -> FP conversion may lose
  some precision (but that's a bug in perl).

* A reference: Set uses Storable to serialize the data referenced and
  stores the fact that it did so in a flag. Get then sees the flag and
  unserializes the data, restoring all values and types, and
  references within the serialized set (but breaking references from
  "outside").

So Cache::Memcached preserves the value for all scalar types except
character strings. This is a surprising inconsistency which should be
fixed. And it can be fixed as the OP showed.


> The same happens if you store 32-bit integer and 32-bit float into the
> file: it's up to the reading side to decide what type will have each
> 4-byte sequence, or whether they will be processed as the untyped byte
> sequences.

In Perl, if I store an integer or float into a file and read it back I
will get a string with the same numeric value, which I can use just like
a number.

In C, variables are statically typed, so I will get the same value iff I
read those four bytes into a variable of the same type (on the same
platform - this isn't portable between different C implementations).


> So, if it doesn't work automagically for files, why to expect it to
> work so for memcached?

Because it can :-).

More importantly, the interface for files is different. There are all
kinds of I/O-Layers which we don't have (and probably don't need) in
Cache::Memcached. OTOH, you can't just store a reference into a file
and expect to read anything useful back - but Cache::Memcached does
something usefule with a reference. So why shouldn't it do something
useful with a character string, which is much simpler (and a much more
portable concept)?


> Likewise, from the description of how DBD::Oracle works it follows
> that it also doesn't save UTF-8 flag of Perl scalars somewhere (not a
> surprise).  Instead, the decision is again made anew each time the
> data is retrieved: if it comes from the binary column, the UTF-8 flag
> will be cleared (even if you saved $text there).  It the data comes
> from the text column, _and_ some setting (like environment variable or
> a DBD constructor parameter) is set, then UTF-8 flag will be set (even
> if originally you stored $binary there).

Forget about the flag - I'm interested in the value, not the type. If I
wanted static types, I'd be programming in C, not Perl. When I store
something in a column of a suitable type, I get the same value back. If
I choose the wrong type I will get garbage back (or more likely I can't
even store it because the RDBMS will catch the violation). But I know
that columns in relational databases have types, and I can choose a
suitable type for my data. memcached has just one type (an octet
string plus flags). Cache::Memcached doesn't allow the application to
access the flag but uses the flags itself to store some kind of type
information. 


> Of course, if the application is designed in a way that it always
> saves $text into the text columns, $binary into the binary columns,
> and the relevant setting is enabled, then you have the impression that
> UTF-8 flag is saved and restored.

Yes. If I do things correctly it will work correctly. Quelle surprise!

> But this is because most databases have the notion of data type, while
> plain files _and_ memcached lack it.

Files are (on most platforms) unstructured octet streams. Memcached also
provides storage for a simple octet stream plus some flags. but
Cache::Memcached doesn't! I would have no problem if Cache::Memcached
simply exposed the interface of memcached to the application: Then I
would *know* that I can store only octet strings and I would have to
think about serializing anything else myself, just as for files.
But it doesn't do that. I looks like it could handle all perl data types
transparently, but fails to do so for one rather basic type. If I can
store arbitrarily complex nested data structures, why can't I store a
simple string? That's just inconsistent, surprising, and totally
undocumented.


> Now, one could say, why bother?  Even if memcached is untyped by
> itself, let's emulate text-or-binary type with the flag which we will
> call F_UTF8 or something.  And like the patch in question does, we'd
> save the value of scalar's UTF-8 flag on data store, and restore it on
> data fetch (perhaps only when some external setting permits us to do
> so, as for DBD::Oracle).
> 
> There will be a problem with this.  Suppose I have the following
> setup: there's a number of populate scripts that load the data from
> say, files, into the memcached.
[...]
> But what's important is that _populate_ scripts do not care about
> characters, they just copy bytes around (it's like when you copy the
> file with 32-bit int and 32-bit float you don't care about these
> types, just move bytes around).  So naturally populate scripts do not
> have 'use utf8;' in them, nor do they open the input files in :utf8
> mode.  What's also important is that populate scripts don't have to be
> written in Perl.

You have the same problem with F_STORABLE. Actually it is even worse
there: While the concept of "a text string represented in UTF-8
encoding" is independent of the programming language, the Storable
format is perl-specific, documented only by source-code, and might even
change (in fact, the POD says that it did change between 2.02 and 2.03).


> Now, if we imagine that we employed F_UTF8, the following would
> happen: I still may control if the data retrieved from memcached would
> have UTF-8 set or not with the environment setting or some other knob.
> But all scripts that do stores into memcached (like my populate
> scripts) _have_ to work in utf8 mode, otherwise the internal F_UTF8
> won't be set.  This is very unnatural and would cause all sorts of
> confusion,

No, I think it is perfectly natural that all programs accessing the same
data need to agree how it is stored. This may be enforced by the use of
a common library, or you may have to write this into the specification,
if you need to access the data from different languages or if your
library cannot enforce some aspect.


> because basically I have to know the data type just to copy
> data around.  Note how this is different from typed databases: there
> you can't avoid making the decision when you do the store: you either
> save your data into the text _or_ the binary column.  But memcached is
> like untyped file: no decision is _required_ when you do the store,
> you'll decide everything on fetch.

But the flags are part of the data stored by memcached. You can copy
around the data without knowing what it means if you also copy the
flags. You can't just leave off the flags anymore than you can just
leave off the first 4 bytes of an arbitrary file.


> In other words, it is hard to maintain the F_UTF8 setting, because you
> are forced to carry the knowledge about data type across all
> store/fetch operations, just to preserve the flag to the time when you
> decide to do real character processing.

It is no harder than F_STORABLE or F_COMPRESSED. I don't see you arguing
that these should be removed.

> It would be much safer (and also more natural) if we won't try to
> pretend that memcached has the notion of type when it clearly doesn't,
> and do all typing explicitly, like we do it with plain files.

memcached doesn't have the notion of types, but Cache::Memcached does:
It distinguishes between unstructured byte strings and frozen
structures. Adding a third type (utf-8 encoded text strings) wouldn't
change that conceptionally. It would even be universally useful as it is
portable to other programming langugaes (unlike Storable).

        hp

-- 
   _  | Peter J. Holzer    | It took a genius to create [TeX],
|_|_) | Sysadmin WSR       | and it takes a genius to maintain it.
| |   | [EMAIL PROTECTED]         | That's not engineering, that's art.
__/   | http://www.hjp.at/ |    -- David Kastrup in comp.text.tex

signature.asc
Description: Digital signature

Re: [PATCH] utf8 flag support on perl lib

Reply via email to