Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Mon, Dec 20, 2010 at 09:03:56AM +0900, Itagaki Takahiro wrote: On Mon, Dec 20, 2010 at 01:34, Tom Lane t...@sss.pgh.pa.us wrote: I agree that the default encoding is UTF-8, but it should be configurable by the 'encoding' parameter in control files. Why is it necessary to have such a parameter at all? UTF-8 is not a superset of all encodings. I think you mean Unicode is not a superset of all character sets. I've heard this before but never found what's missing. [citation needed]? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Mon, Dec 20, 2010 at 08:01:42PM +0100, Martijn van Oosterhout wrote: On Mon, Dec 20, 2010 at 09:03:56AM +0900, Itagaki Takahiro wrote: On Mon, Dec 20, 2010 at 01:34, Tom Lane t...@sss.pgh.pa.us wrote: I agree that the default encoding is UTF-8, but it should be configurable by the 'encoding' parameter in control files. Why is it necessary to have such a parameter at all? UTF-8 is not a superset of all encodings. I think you mean Unicode is not a superset of all character sets. I've heard this before but never found what's missing. [citation needed]? Windows-1252, ISO-2022-JP-2 and EUC-TW are such encodings. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
David Fetter da...@fetter.org writes: On Mon, Dec 20, 2010 at 08:01:42PM +0100, Martijn van Oosterhout wrote: I think you mean Unicode is not a superset of all character sets. I've heard this before but never found what's missing. [citation needed]? Windows-1252, ISO-2022-JP-2 and EUC-TW are such encodings. [citation needed]? Exactly what characters are missing, and why would the Unicode people have chosen to leave them out? It's not like they've not heard of those encodings, I'm sure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Mon, Dec 20, 2010 at 02:10:39PM -0500, Tom Lane wrote: David Fetter da...@fetter.org writes: On Mon, Dec 20, 2010 at 08:01:42PM +0100, Martijn van Oosterhout wrote: I think you mean Unicode is not a superset of all character sets. I've heard this before but never found what's missing. [citation needed]? Windows-1252, ISO-2022-JP-2 and EUC-TW are such encodings. [citation needed]? Exactly what characters are missing, and why would the Unicode people have chosen to leave them out? It's not like they've not heard of those encodings, I'm sure. regards, tom lane Here is an interesting description of some of the gotchas: http://en.wikipedia.org/wiki/Windows-1252 Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Dec 20, 2010, at 11:53 AM, Kenneth Marshall wrote: Here is an interesting description of some of the gotchas: http://en.wikipedia.org/wiki/Windows-1252 FWIW, those are gotchas translating between Windows 1252 and Latin-1. Windows 1252's nerbles translate to UTF-8 just fine. David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
Kenneth Marshall k...@rice.edu writes: On Mon, Dec 20, 2010 at 02:10:39PM -0500, Tom Lane wrote: [citation needed]? Exactly what characters are missing, and why would the Unicode people have chosen to leave them out? It's not like they've not heard of those encodings, I'm sure. Here is an interesting description of some of the gotchas: http://en.wikipedia.org/wiki/Windows-1252 Well, it's interesting, but I see no glyphs on that page that lack Unicode assignments. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Mon, Dec 20, 2010 at 03:08:48PM -0500, Tom Lane wrote: Kenneth Marshall k...@rice.edu writes: On Mon, Dec 20, 2010 at 02:10:39PM -0500, Tom Lane wrote: [citation needed]? Exactly what characters are missing, and why would the Unicode people have chosen to leave them out? It's not like they've not heard of those encodings, I'm sure. Here is an interesting description of some of the gotchas: http://en.wikipedia.org/wiki/Windows-1252 Well, it's interesting, but I see no glyphs on that page that lack Unicode assignments. regards, tom lane You are correct. I mis-read the text. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
2010/12/20 Martijn van Oosterhout klep...@svana.org: On Mon, Dec 20, 2010 at 09:03:56AM +0900, Itagaki Takahiro wrote: UTF-8 is not a superset of all encodings. I think you mean Unicode is not a superset of all character sets. I've heard this before but never found what's missing. [citation needed]? From URL:http://en.wikipedia.org/wiki/Japanese_language_and_computers#Character_encodings: Unicode is supposed to solve all encoding problems in all languages of the world. [..] There are still controversies. For Japanese, the kanji characters have been unified with Chinese, that is a character considered to be the same in both Japanese and Chinese have been given one and the same code number in Unicode, even if they look a little different. This process, called Han unification, has caused controversy. For examples (my browser doesn't show any differences though, probably because I don't have the corresponding fonts): URL:http://en.wikipedia.org/wiki/Han_unification#Examples_of_language_dependent_characters Nicolas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Mon, Dec 20, 2010 at 10:15:56PM +0100, Nicolas Barbier wrote: From URL:http://en.wikipedia.org/wiki/Japanese_language_and_computers#Character_encodings: Unicode is supposed to solve all encoding problems in all languages of the world. [..] There are still controversies. For Japanese, the kanji characters have been unified with Chinese, that is a character considered to be the same in both Japanese and Chinese have been given one and the same code number in Unicode, even if they look a little different. This process, called Han unification, has caused controversy. From http://en.wikipedia.org/wiki/CJK_Unified_Ideographs: However, the source separation rule states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding. From all the references I've seen this has been applied everywhere and any failures to roundtrip conversions are considered bugs and I can't believe that at this point they havn't all been fixed. This is kind of underscored by the fact that references always point to theoretical problems rather than actual lists of characters that can't be converted. ISTM that since all the mapping tables are public it should be a SMOP to *prove* roundtrip conversions are safe, or identify the problems. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Tue, Dec 21, 2010 at 08:04, Martijn van Oosterhout klep...@svana.org wrote: On Mon, Dec 20, 2010 at 10:15:56PM +0100, Nicolas Barbier wrote: From URL:http://en.wikipedia.org/wiki/Japanese_language_and_computers#Character_encodings: ISTM that since all the mapping tables are public it should be a SMOP to *prove* roundtrip conversions are safe, or identify the problems. Another issue in Japanese users is EUDC (End User Defined Character). Unfortunately for both postgres developers and application developers in Japan, many machine dependence characters are still used in popular mobile phones in Japan. Their native encoding is SHIFT_JIS, and we have an EUDC mapping for SHIFT_JIS to/from EUC_JP. But we don't have for UTF-8 to/from other encodings. That is one of the reasons why we cannot move to the UTF-8 world completely. Imagine that a module that manipulate EUDC text. It will be written in EUC_JP because SHIFT_JIS is not supported in postgres. Also, it cannot be rewritten in UTF-8 because there are no mapping for the characters used in it. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
Hi, Thanks for your review and your time. Trying to answer some of your points there: Robert Haas robertmh...@gmail.com writes: I spent a little time looking at this tonight. I'm going to give you the same general advice that I've given other people who have submitted very large patches of this type: it'll be a lot easier to get this committed if we can break it into smaller pieces. A pretty easy thing to do would be to separate the changes that turn the existing contrib modules into extensions into its own patch. Those are pretty mechanical, so applying them after would be quite easy. That said I don't know exactly how much it would ease the review, but if you say it does, I'll separate the patches. A possibly more controversial idea is to actually remove the notion of a relocatable extension from this patch, and all the supporting machinery, and make that a follow-on patch. I think it was arranged like that before and somehow got collapsed, but maybe the notion is worth revisiting, because this is a lot of code: 224 files changed, 4713 insertions(+), 293 deletions(-) It was like that before indeed. The problem is how to reliably implement the CREATE EXTENSION WITH SCHEMA, which is like the second best thing about all this infrastructure work just behind the pg_dump support. I'm not clear that this option should come on its own in a later patch or that doing so simplifies anything, but not opinionated about that: let's organise ourselves as best as we are able to. Again, will split the code in a third patch if that's what's necessary. That issue aside, I think there is still a fair amount of cleanup and code review that needs to happen here before this can be considered for commit. Here are a few things I noticed on a first read-through: - pg_execute_sql_string() and pg_execute_sql_file() are documented, but are not actually part of the patch. Ouch, git ain't that magic after all. Will clean-up next version(s). - The description of the NO USER DATA option is almost content-free. It basically says this option is here because it wouldn't work if we didn't have this option. + (see xref linkend=functions-extension), this option allow + extension authors to prepare installation scripts that will avoid + creating user data when restoring. I'd need some specifics here to be able to do much better. My guess is that an example is what needed, and it would fit in the main section of the manual about it, 35.16. Putting it all together: create extension. - listDependentObjects() appears to be largely cut-and-pasted from some other function. It calls AcquireDeletionLock() and the comments mention constructing a list of objects to delete. It sure doesn't seem right to be acquiring AccessExclusiveLocks on lots of things in a function that's there to support the pg_extension_objects SRF. Ah well, true. AccessShareLock seems the minimum we can take, as we surely want to prevent the extension and its objects disappearing under us, right? - This bit certainly violates our message style guidelines: + elog(NOTICE, +Installing extension '%s' from '%s', in schema %s, %s user data, +stmt-extname, +get_extension_absolute_path(control-script), +schema ? schema : public, +create_extension_with_user_data ? with : without); User-facing messages need to be ereport, not elog, so they can be translated, and mustn't be assembled from pieces, as you've done with %s user data. This message is pretty useful for debug and review of the patch, but I'm not sure we want to keep after that... - Has the issue of changing custom_variable_classes from PGC_SIGHUP to PGC_SUSET been discussed? I am not sure whether that's an OK thing to do. If it is OK, then the documentation also needs updating. I've been talking about it since the very firsts versions of the patch on this list but didn't receive much answer. You have a detailed mail about that later in the thread, will start a new thread about that from there. - This comment looks like leftovers: + /* FIXME: add PGC_EXTENSION so that we don't abuse PGC_SIGHUP here? */ + SetConfigOption(custom_variable_classes, + newval, PGC_SIGHUP, PGC_S_EXTENSION); Apologies if I missed the previous discussion of this, but why are we adding a new GUC context? Well we're using PGC_SIGHUP and PGC_S_EXTENSION, should we use PGC_EXTENSION too is what the comment asks, so it's not very clear but not a leftover. We're adding a new GUC source here, not a new context. Let's include that in the new thread about cvc, though. - Did we decide to ditch the encoding parameter for extension scripts and mandate UTF-8? No we didn't, we decided that the default encoding is UTF-8 and that any contrib script defaults to UTF-8, so that it's not necessary to care about the 'encoding' parameter
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Sun, Dec 19, 2010 at 5:30 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Robert Haas robertmh...@gmail.com writes: I spent a little time looking at this tonight. I'm going to give you the same general advice that I've given other people who have submitted very large patches of this type: it'll be a lot easier to get this committed if we can break it into smaller pieces. A pretty easy thing to do would be to separate the changes that turn the existing contrib modules into extensions into its own patch. Those are pretty mechanical, so applying them after would be quite easy. That said I don't know exactly how much it would ease the review, but if you say it does, I'll separate the patches. Well, I don't know who is ultimately going to end up applying this. I can only say what helps me. If somebody else wants to jump in here and say they're good to review and commit the whole thing in one go, so be it... A possibly more controversial idea is to actually remove the notion of a relocatable extension from this patch, and all the supporting machinery, and make that a follow-on patch. I think it was arranged like that before and somehow got collapsed, but maybe the notion is worth revisiting, because this is a lot of code: 224 files changed, 4713 insertions(+), 293 deletions(-) It was like that before indeed. The problem is how to reliably implement the CREATE EXTENSION WITH SCHEMA, which is like the second best thing about all this infrastructure work just behind the pg_dump support. I'm not clear that this option should come on its own in a later patch or that doing so simplifies anything, but not opinionated about that: let's organise ourselves as best as we are able to. I'm not totally certain of it either, but I think it might. - The description of the NO USER DATA option is almost content-free. It basically says this option is here because it wouldn't work if we didn't have this option. + (see xref linkend=functions-extension), this option allow + extension authors to prepare installation scripts that will avoid + creating user data when restoring. I'd need some specifics here to be able to do much better. My guess is that an example is what needed, and it would fit in the main section of the manual about it, 35.16. Putting it all together: create extension. No, I think you need to describe what the option actually does, as opposed to what problem it solves. - listDependentObjects() appears to be largely cut-and-pasted from some other function. It calls AcquireDeletionLock() and the comments mention constructing a list of objects to delete. It sure doesn't seem right to be acquiring AccessExclusiveLocks on lots of things in a function that's there to support the pg_extension_objects SRF. Ah well, true. AccessShareLock seems the minimum we can take, as we surely want to prevent the extension and its objects disappearing under us, right? On two minutes thought, I'm not entirely sure that we need to take locks at all here, but if we do, AccessShareLock is certainly the right one. One idea I have is that we might want to move AcquireDeletionLock to objectaddress.c or maybe even somewhere in storage/lmgr, rename it to something like LockObjectByAddress, and give it a second argument for the lock strength. - This bit certainly violates our message style guidelines: + elog(NOTICE, + Installing extension '%s' from '%s', in schema %s, %s user data, + stmt-extname, + get_extension_absolute_path(control-script), + schema ? schema : public, + create_extension_with_user_data ? with : without); User-facing messages need to be ereport, not elog, so they can be translated, and mustn't be assembled from pieces, as you've done with %s user data. This message is pretty useful for debug and review of the patch, but I'm not sure we want to keep after that... Either fix it or take it out. If you're proposing this for commit, it shouldn't contain anything you don't want committed. - Has the issue of changing custom_variable_classes from PGC_SIGHUP to PGC_SUSET been discussed? I am not sure whether that's an OK thing to do. If it is OK, then the documentation also needs updating. I've been talking about it since the very firsts versions of the patch on this list but didn't receive much answer. You have a detailed mail about that later in the thread, will start a new thread about that from there. Sorry, I missed that discussion. As you know I try to follow -hackers pretty closely, but apparently not closely enough in this case. Can you provide links to the archives? - This comment looks like leftovers: + /* FIXME: add PGC_EXTENSION so that we don't abuse PGC_SIGHUP here? */ + SetConfigOption(custom_variable_classes, + newval, PGC_SIGHUP, PGC_S_EXTENSION); Apologies if
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
- Did we decide to ditch the encoding parameter for extension scripts and mandate UTF-8? No we didn't, we decided that the default encoding is UTF-8 and that any contrib script defaults to UTF-8, so that it's not necessary to care about the 'encoding' parameter in the control file there. Itagaki required that the extension code is encoding aware, and it wasn't clear that the control file 'encoding' parameter would be enough in his side of the world, so the encoding support is currently offered to authors in the control file and users still can override it in the CREATE EXTENSION command. I think 'encoding' parameter is enough. Of course embedding encoding specifiers in SQL files are better, but there would be technical problems. (Just for reference: http://www.python.org/dev/peps/pep-0263/ ) I'm about sure that we don't want to remove that support, and I think that the command part of it ain't required, but that's for people with complex encoding problems to tell us IMO. Oh, I wasn't aware that Itagaki-san had objected to Tom's proposal. I agree that the default encoding is UTF-8, but it should be configurable by the 'encoding' parameter in control files. If we use UTF-8 as the the default encoding, we need to treat 3 encodings at once (server, client, and UTF-8) anyway. So, I think no additional complexity will be added even if we support a configurable encoding as the third encoding. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
Itagaki Takahiro itagaki.takah...@gmail.com writes: Oh, I wasn't aware that Itagaki-san had objected to Tom's proposal. I agree that the default encoding is UTF-8, but it should be configurable by the 'encoding' parameter in control files. Why is it necessary to have such a parameter at all? AFAICS it just adds complexity for little if any gain. Most extension files will probably be pure ASCII anyway. Dictionary files are *far* more likely to contain non-ASCII characters. If we've gotten along fine with requiring dictionary files to be UTF8, I can't see any reason why we can't or shouldn't take the same approach to extension files. So, I think no additional complexity will be added even if we support a configurable encoding as the third encoding. This is nonsense. The mere existence of the parameter requires code to support it and user documentation to explain it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
Tom Lane t...@sss.pgh.pa.us writes: Why is it necessary to have such a parameter at all? AFAICS it just adds complexity for little if any gain. Most extension files will probably be pure ASCII anyway. Dictionary files are *far* more likely to contain non-ASCII characters. If we've gotten along fine with requiring dictionary files to be UTF8, I can't see any reason why we can't or shouldn't take the same approach to extension files. Don't forget that extensions are not just contrib or third party Open Source software, but a lot of in-house code, mostly functions written in SQL and PLpgSQL. In non-English speaking countries, the parameter names and comments are typically not written in English. When we're talking Japan they have some quite specifics needs and I came to understand that the database encoding and the script encoding are not to be the same, usually. So I still vote for handling this parameter. So, I think no additional complexity will be added even if we support a configurable encoding as the third encoding. This is nonsense. The mere existence of the parameter requires code to support it and user documentation to explain it. The whole documentation needs to be: varlistentry termreplaceable class=parameterencoding/replaceable/term listitem para The literalencoding/literal in which the script file is read. /para /listitem /varlistentry The code to support that is on the order of 25 lines of code, once we get rid of the SQL command level support for it. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v20 (bitrot fixes)
On Mon, Dec 20, 2010 at 01:34, Tom Lane t...@sss.pgh.pa.us wrote: I agree that the default encoding is UTF-8, but it should be configurable by the 'encoding' parameter in control files. Why is it necessary to have such a parameter at all? UTF-8 is not a superset of all encodings. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers