[Developers] CALL ENDS: HACK: New attributes on StorageManager, related to encoding problems.

Pierre van Rooden Wed, 02 Feb 2005 23:43:49 -0800

CALL FOR:
New attributes on StorageManager.

Called by: Michiel Meeuwissen
Total tally on this call : +6

START OF VOTING:   2005-01-31 21:30
END OF CALL:       2005-02-03 21:30

YEA (6) : Kees Jongenburger, Pierre van Rooden, Rico Jansen, Rob Vermeulen, Daniel Ockeloen, Marcel Maatkamp

ABSTAIN (1) : Rob van Maris

NAY (0) :

VETO (0) :

No votes, assumed abstained (8): Eduard Witteveen, Jaco de Groot, Andre van Toly, Johannes Verelst, Nico Klasens, Gerard van Enk, Mark Huijser, Ernst Bunders

Result: call succeeded, The changes can be added.

Michiel Meeuwissen wrote:

CALL FOR:   New attributes on StorageManager.

Follows a long story, I'll add an abstract near the end.

I want to add a few new attributes to StorageManager and one to
DatabaseStorageManager.

I'll first eplain why we need it, then what I propose.

It has to do with bug  #6569 :  Taglibs Using wrong character encoding
for post data, and the situation at VPRO.

The VPRO used to use editors which are encoded according to
ISO-8859-1. Their database (informix) is also ISO-8859-1. A bit
unexpectedly though, the characters specific to CP-1252 did work
too. This was possible because the java string were wrong after a post
on a ISO-8859-1 page, on which CP1252 was posted (most noticiable LEFT
and RIGHT SINGLE QUOTATION MARK, which are often produced by
copy-pasting from MS-Word). So, after such a post, the bytes were
wrongly supposed to be ISO-8859-1, because they are actually
CP1252. Fixing that is a bugfix, because taglib should of course always
do the correct thing, even if that is not entirely logical.

After that fix, though, the String become 'correct', and you will see,
that it appears to actually work worse, because now, you will not be able
to store the CP1252 characters any more in the ISO-8859-1 database
(which makes sense). Further-more, such correct strings display worse on
a ISO-8859-1 page, because java will not produce CP1252 if you request
ISO-8859-1 (which makes sense too). If you request ISO-8859-1 for a
'wrong' string, java produces CP-1252 - which can be considered a lucky
coincidence - which is then displayed as intended by most browsers (which
is sensible, because the cp-1252 specific characters only occupy
non-character positions in ISO-8859-1).

That the situation with 'wrong' strings is not desirable is showed by the fact that you cannot actually use a UTF-8 page then, because then of course the 'wrong' characters will only produce gargage, and you are limited to the actual ISO-8859-1 characters, then.


I hope the above does more or less cleary represent the problem. I
produced much longer mails and discussion at the VPRO about the issue,
which I think mainly proves that it simply _is_ difficult...

Anyhow, my proposal is a bit more straightforward:

First of all, the VPRO has CP1252 in their database, and the front-end
is mostly ISO-8859-1, but of course they don't want question marks on
their pages for all CP1252 data (quotations marks, euro-signs), which
exists plenty.

So, I'll fix that CP1252 is interpreted as such if the database is
ISO-8859-1. That can never harm, because there are no ISO-8859-1
characters which are not on the same place in CP1252. Now, you fetch
'correct' strings in any case.

Then, I propose the possibility to provide 'surrogators' on database
level. A surrogator is a something which translates 'impossible
characters' to something which comes close enough but is not the real
thing. E.g. it can replace the euro-sign with the word 'EURO'.

Concretely that is a option 'GET_SURROGATOR' which should point to a
org.mmbase.util.transformer.CharTransformer implementation.

StorageManagerFactory gets a new method 'getGetSurrogator()', for use in
the StorageManager implementations.

This method is (can) then used after aquiring the String from the
database in DatabaseStorageManager. If I then set this to
'org.mmbase.util.transformers.CP1252Surrogator' then I ensure that the
Strings are really ISO-8859-1 compatible, and can be displayed
acceptably on a ISO-8859-1 without displaying 'unknown characters' for
the ms-word quotes (those become simple ASCII quotes) etc.

So, this fixes the 'display' traject (from database to front-end), and
as long as there are ISO-8859-1 pages there, this option can be on, and
everything works. If the option is off, you can use unicode pages, and
the existing CP1252 characters really work (real euro signs will appear).


Remains the issue of the 'storage' traject (from editor-front-end to
database).  The editors must be in unicode (as are all editors in
mmbase, like JSP-editors and editwizards). So, when strings arrive at the
database layer there is a problem, because the database actually stores
only ISO-8859-1, so now the euro-symbol and the like cannot be stored
anymore. This would seem undesirable.

I deviced two solutions for this:

 - You can also configure a 'SET_SURROGATOR', which ensures that
   surrogates are stored, which is at least better then question marks
   or something like that. There gets no new CP1252 in the db in this
   way, which makes sense because the database is ISO-8859-1.

 - In DatabaseStorageManager I created an option LIE_CP1252. This will
   make the String 'wrong' before feeding to JDBC, so that they are the
   same as before, which proved to 'work'.


There is one draw-back for using unicode editors when your backend cannot
handle this. That is that editors don't get immediate feed-back that
their chinese characters were not actually stored well, because the
java-string is still in the Node-cache, so if you look at the object
immediately after editing it, it looks like as if this chinese was simply
stored correctly. Only some time later, or even only after a
server-restart will it become clear that they actually were replaced by
question marks, which may possibly baffle the editor a bit.

So I propose that setStringValue of DatabaseStorageManager (a mere
protected method) does return the value it predicts that
'getStringValue' will produce. In setValue this can be used as follows:

 node.storeValue(fieldName, setStringValue(statement, index, value,
 field, node));

in stead of only setStringValue(statement, index, value, field, node);

So the original value in the node, is updated by the value which will
actualy be produced when it would have come from the database. This will
give the editors the feedback which they expect.

This feature could also be used to solve the isue of 'maximal length' of
string fields. I think this plays in mysql, if you have a field with a
maximal length of 40 chars, and you submit a string of 60 chars, these 20
chars will only dissapear after the node is refetched from the
database. I have not yet fixed/tested this, but I think this bug can now
easily be solved, perhaps by an extra database option stating that it should
perform this truncating.


ABSTRACT

- A new attribute SET_SURROGATOR and corresponding method on StorageManagerFactory. - A new attribute GET_SURROGATOR and corresponding method on StorageManagerFactory.

- A new attribute LIE_CP1252 on DatabaseStorageManager. No method, it is
  only checked in setStringValue if the database encoding is
  ISO-8859-1. It will ensure that you can store CP1252 in ISO-8859-1
  database. More or less the same behaviour as browsers, which sumbit
  CP1252 for ISO-8859-1 anyway.

- All of the above attributes will be off on default, so there will not
  change anything for most people. VPRO will use GET_SURROGATOR (except
  for editor-instalations), and LIE_CP1252 in their informix.xml.
  Since I will fix the errorneous form-handling anyway, it may be that
  you need this option if you depended on it (as the vpro did). The
  SET/GET SURROGATORS can also be used for other reasons, with other
  'chartransformers' (for example you could surrogate unicode with
  'escaped' characters, and unescape on get, which enables you to store
  unicode on completely ascii-only targets).

  There will be no consequences whatsoever if you are using unicode
  everywhere, as you should :-)

- setStringValue() of DatabaseStorageManager will return the 'expected
  value for getStringValue', to update the node itself, which may be in
  the node-cache.

This will fix: #6632 Open Bug Medium 1.7.2 Misc Cache entries are not garanteed to accord with database content.

This change is for MMBase 1.7 and MMBase 1.8. I tested a previous
version of this code @vpro, and will test this excact same code asap
(when I backported the changes to 1.7) but I don't expect any
troubles. At least it _will_ be thoroughly tested.

Diff attached for 1.8.

--
Pierre van Rooden
Mediapark, C 107 tel. +31 (0)35 6772815
"Anything worth doing is worth overdoing."
_______________________________________________
Developers mailing list
[email protected]
http://lists.mmbase.org/mailman/listinfo/developers

[Developers] CALL ENDS: HACK: New attributes on StorageManager, related to encoding problems.

Reply via email to