CALL FOR: New attributes on StorageManager.
Called by: Michiel Meeuwissen Total tally on this call : +6
START OF VOTING: 2005-01-31 21:30 END OF CALL: 2005-02-03 21:30
YEA (6) : Kees Jongenburger, Pierre van Rooden, Rico Jansen, Rob Vermeulen, Daniel Ockeloen, Marcel Maatkamp
ABSTAIN (1) : Rob van Maris
NAY (0) :
VETO (0) :
No votes, assumed abstained (8): Eduard Witteveen, Jaco de Groot, Andre van Toly, Johannes Verelst, Nico Klasens, Gerard van Enk, Mark Huijser, Ernst Bunders
Result: call succeeded, The changes can be added.
Michiel Meeuwissen wrote:
CALL FOR: New attributes on StorageManager.
Follows a long story, I'll add an abstract near the end.
I want to add a few new attributes to StorageManager and one to DatabaseStorageManager.
I'll first eplain why we need it, then what I propose.
It has to do with bug #6569 : Taglibs Using wrong character encoding for post data, and the situation at VPRO.
The VPRO used to use editors which are encoded according to ISO-8859-1. Their database (informix) is also ISO-8859-1. A bit unexpectedly though, the characters specific to CP-1252 did work too. This was possible because the java string were wrong after a post on a ISO-8859-1 page, on which CP1252 was posted (most noticiable LEFT and RIGHT SINGLE QUOTATION MARK, which are often produced by copy-pasting from MS-Word). So, after such a post, the bytes were wrongly supposed to be ISO-8859-1, because they are actually CP1252. Fixing that is a bugfix, because taglib should of course always do the correct thing, even if that is not entirely logical.
After that fix, though, the String become 'correct', and you will see, that it appears to actually work worse, because now, you will not be able to store the CP1252 characters any more in the ISO-8859-1 database (which makes sense). Further-more, such correct strings display worse on a ISO-8859-1 page, because java will not produce CP1252 if you request ISO-8859-1 (which makes sense too). If you request ISO-8859-1 for a 'wrong' string, java produces CP-1252 - which can be considered a lucky coincidence - which is then displayed as intended by most browsers (which is sensible, because the cp-1252 specific characters only occupy non-character positions in ISO-8859-1).
That the situation with 'wrong' strings is not desirable is showed by
the fact that you cannot actually use a UTF-8 page then, because then of
course the 'wrong' characters will only produce gargage, and you are
limited to the actual ISO-8859-1 characters, then.
I hope the above does more or less cleary represent the problem. I produced much longer mails and discussion at the VPRO about the issue, which I think mainly proves that it simply _is_ difficult...
Anyhow, my proposal is a bit more straightforward:
First of all, the VPRO has CP1252 in their database, and the front-end is mostly ISO-8859-1, but of course they don't want question marks on their pages for all CP1252 data (quotations marks, euro-signs), which exists plenty.
So, I'll fix that CP1252 is interpreted as such if the database is ISO-8859-1. That can never harm, because there are no ISO-8859-1 characters which are not on the same place in CP1252. Now, you fetch 'correct' strings in any case.
Then, I propose the possibility to provide 'surrogators' on database level. A surrogator is a something which translates 'impossible characters' to something which comes close enough but is not the real thing. E.g. it can replace the euro-sign with the word 'EURO'.
Concretely that is a option 'GET_SURROGATOR' which should point to a org.mmbase.util.transformer.CharTransformer implementation.
StorageManagerFactory gets a new method 'getGetSurrogator()', for use in the StorageManager implementations.
This method is (can) then used after aquiring the String from the database in DatabaseStorageManager. If I then set this to 'org.mmbase.util.transformers.CP1252Surrogator' then I ensure that the Strings are really ISO-8859-1 compatible, and can be displayed acceptably on a ISO-8859-1 without displaying 'unknown characters' for the ms-word quotes (those become simple ASCII quotes) etc.
So, this fixes the 'display' traject (from database to front-end), and as long as there are ISO-8859-1 pages there, this option can be on, and everything works. If the option is off, you can use unicode pages, and the existing CP1252 characters really work (real euro signs will appear).
Remains the issue of the 'storage' traject (from editor-front-end to database). The editors must be in unicode (as are all editors in mmbase, like JSP-editors and editwizards). So, when strings arrive at the database layer there is a problem, because the database actually stores only ISO-8859-1, so now the euro-symbol and the like cannot be stored anymore. This would seem undesirable.
I deviced two solutions for this:
- You can also configure a 'SET_SURROGATOR', which ensures that surrogates are stored, which is at least better then question marks or something like that. There gets no new CP1252 in the db in this way, which makes sense because the database is ISO-8859-1.
- In DatabaseStorageManager I created an option LIE_CP1252. This will make the String 'wrong' before feeding to JDBC, so that they are the same as before, which proved to 'work'.
There is one draw-back for using unicode editors when your backend cannot handle this. That is that editors don't get immediate feed-back that their chinese characters were not actually stored well, because the java-string is still in the Node-cache, so if you look at the object immediately after editing it, it looks like as if this chinese was simply stored correctly. Only some time later, or even only after a server-restart will it become clear that they actually were replaced by question marks, which may possibly baffle the editor a bit.
So I propose that setStringValue of DatabaseStorageManager (a mere protected method) does return the value it predicts that 'getStringValue' will produce. In setValue this can be used as follows:
node.storeValue(fieldName, setStringValue(statement, index, value, field, node));
in stead of only setStringValue(statement, index, value, field, node);
So the original value in the node, is updated by the value which will actualy be produced when it would have come from the database. This will give the editors the feedback which they expect.
This feature could also be used to solve the isue of 'maximal length' of string fields. I think this plays in mysql, if you have a field with a maximal length of 40 chars, and you submit a string of 60 chars, these 20 chars will only dissapear after the node is refetched from the database. I have not yet fixed/tested this, but I think this bug can now easily be solved, perhaps by an extra database option stating that it should perform this truncating.
ABSTRACT
- A new attribute SET_SURROGATOR and corresponding method on StorageManagerFactory. - A new attribute GET_SURROGATOR and corresponding method on StorageManagerFactory.
- A new attribute LIE_CP1252 on DatabaseStorageManager. No method, it is only checked in setStringValue if the database encoding is ISO-8859-1. It will ensure that you can store CP1252 in ISO-8859-1 database. More or less the same behaviour as browsers, which sumbit CP1252 for ISO-8859-1 anyway.
- All of the above attributes will be off on default, so there will not change anything for most people. VPRO will use GET_SURROGATOR (except for editor-instalations), and LIE_CP1252 in their informix.xml. Since I will fix the errorneous form-handling anyway, it may be that you need this option if you depended on it (as the vpro did). The SET/GET SURROGATORS can also be used for other reasons, with other 'chartransformers' (for example you could surrogate unicode with 'escaped' characters, and unescape on get, which enables you to store unicode on completely ascii-only targets).
There will be no consequences whatsoever if you are using unicode everywhere, as you should :-)
- setStringValue() of DatabaseStorageManager will return the 'expected value for getStringValue', to update the node itself, which may be in the node-cache.
This will fix:
#6632 Open Bug Medium 1.7.2 Misc Cache entries are not garanteed to accord with database content.
This change is for MMBase 1.7 and MMBase 1.8. I tested a previous version of this code @vpro, and will test this excact same code asap (when I backported the changes to 1.7) but I don't expect any troubles. At least it _will_ be thoroughly tested.
Diff attached for 1.8.
-- Pierre van Rooden Mediapark, C 107 tel. +31 (0)35 6772815 "Anything worth doing is worth overdoing." _______________________________________________ Developers mailing list [email protected] http://lists.mmbase.org/mailman/listinfo/developers
