> I'd be much more impressed by seeing a road map for how we get to a > useful amount of added functionality --- which, to my mind, would be > the ability to support N different encodings in one database, for N>2. > But even if you think N=2 is sufficient, we haven't got a road map, and > commandeering spec-mandated syntax for an inadequate feature doesn't seem > like a good first step. It'll just make our backwards-compatibility > problems even worse when somebody does come up with a real solution.
I have been thinking about this for years and I think the key idea for this is, implementing "universal encoding". The universal encoding should have following characteristics to implement N>2 encoding in a database. 1) no loss of round trip encoding conversion 2) no mapping table is necessary to convert from/to existing encodings Once we implement the universal encoding, other problem such as "pg_database with multiple encoding problem" can be solved easily. Currently there's no such an universal encoding in the universe, I think the only way is, inventing it by ourselves. At this point the design of the encoding I have in mind is, 1) 1 byte encoding identifier + 7 bytes body (totaly 8 bytes). The encoding identifier's value is between 0x80 and 0xff and is assigned to exiting encoding such as UTF-8, ascii, EUC-JP and so on. The encodings should be limited to "database safe" encodings. The encoding body is raw characters represented by existing encodings. This form is called "word". 2) We also have "mutibyte" representation of the universal encoding. The first byte represents the lenght of the multibyte character (similar to the first byte of UTF-8). The second byte is the encoding identifier explained in above. The rest of the character is same as above. #1 and #2 are logically same and converted to each other, and we can use one of them whenever we like. The form #1 is easy to handle because each word has fixed length (8 bytes). So probably used in temporary data in memory. The second form can save space and will be used in the data itself. If we want to have a table encoded in an encoding different from the database encoding, the table is encoded in the universal encoding. pg_class should remember the fact to avoid the confusion about what encoding a table is using. I think majority of tables in a database uses the same encoding as the database encoding. Only a few tables want to have different encoding. The design pushes the penalty to such minorities. If we need to join two tables which have different encoding, we need to convert them into the same encoding (this should succeed if the encodings are "compatible"). If fails, the join will fail too. We could expand the technique above for the design which allow each column has different encoding. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers