Re: [HACKERS] Multithreaded SIGPIPE race in libpq on Solaris
On 29 August 2014 01:04, Thomas Munro mu...@ip9.org wrote: On 28 August 2014 23:45, Tom Lane t...@sss.pgh.pa.us wrote: I don't claim to be an expert on this stuff, but I had the idea that multithreaded environments were supposed to track signal state per-thread not just per-process, precisely because of issues like this. After some googling, I found reply #3 in https://community.oracle.com/thread/1950900?start=0tstart=0 and various other sources which say that in Solaris versions 10 they changed SIGPIPE delivery from per process (as specified by UNIX98) to per thread (as specified by POSIX:2001). But we are on version 11, so my theory doesn't look great. (Though 9 is probably still in use out there somewhere...) I discovered that the machine we saw the problem on was running Solaris 9 at the time, but has been upgraded since. So I think my sigwait race theory may have been correct, but we can just put this down to a historical quirk and forget about it, because Solaris 9 is basically deceased (extended support ends in October 2014). Sorry for the noise. Best regards, Thomas Munro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why data of timestamptz does not store value of timezone passed to it?
On 08/29/2014 05:28 AM, Tom Lane wrote: k...@rice.edu k...@rice.edu writes: On Thu, Aug 28, 2014 at 03:33:56PM -0400, Bruce Momjian wrote: So the standard requires storing of original timezone in the data type? I was not aware of that. I do not have a copy of the SQL 92 spec, but several references to the spec mention that it defined the time zone as a format SHH:MM where S represents the sign (+ or -), which seems to be what PostgreSQL uses. Yeah, the spec envisions timezone as being a separate numeric field (ie, a numeric GMT offset) within a timestamp with time zone. One of the ways in which the spec's design is rather broken is that there's no concept of real-world time zones with varying DST rules. Anyway, I agree with the upthread comments that it'd have been better if we'd used some other name for this datatype, and also that it's at least ten years too late to revisit the choice :-(. regards, tom lane What about an alias for timestamptz? The current name is really confusing. As for timestamp + time-zone (not just the offset) data type, it would be very useful. For example, in Java they have 5 time types: LocalDate for representing dates (date in Postgres), LocalTime for representing times (time in Postgres), LocalDateTime to represent a date with a time (timestamp in Postgres), Instant to represent a point on the time-line (timestamptz in Postgres) and ZonedDateTime that models a point on the time-line with a time-zone. Having a type for a time-zone itself would be useful as well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Fri, 2014-08-22 at 12:34 -0400, Robert Haas wrote: Still doesn't look good here. On the same PPC64 machine I've been using for testing: I have a new approach to the patch which is to call a callback at each block allocation and child contexts inherit the callback from their parents. The callback could be defined to simply dereference and increment its argument, which would mean block allocations anywhere in the hierarchy are incrementing the same variable, avoiding the need to traverse the hierarchy. I've made some progress I think (thanks to both Robert and Tomas), but I have a bit more microoptimization and testing to do. I'll mark it returned with feedback for now, though if I find the time I'll do more testing to see if the performance concerns are fully addressed. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] implement subject alternative names support for SSL connections
On 08/28/2014 07:28 PM, Alexey Klyukin wrote: On Mon, Aug 25, 2014 at 12:02 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/24/2014 03:11 PM, Alexey Klyukin wrote: On Wed, Aug 20, 2014 at 11:53 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The patch doesn't seem to support wildcards in alternative names. Is that on purpose? Not really, we just did not have any use case for them, but it seems that RFC 5280 does not disallow them: Finally, the semantics of subject alternative names that include wildcard characters (e.g., as a placeholder for a set of names) are not addressed by this specification. Applications with specific requirements MAY use such names, but they must define the semantics. I've added support for them in the next iteration of the patch attached to this email. Hmm. So wildcards MAY be supported, but should we? I think we should follow the example of common browsers here, or OpenSSL or other SSL libraries; what do they do? RFC 6125 section 6.4.4 Checking of Common Names says: As noted, a client MUST NOT seek a match for a reference identifier of CN-ID if the presented identifiers include a DNS-ID, SRV-ID, URI-ID, or any application-specific identifier types supported by the client. So, to conform to that we shouldn't check the Common Name at all, if an alternative subject field is present. (Relying on OpenSSL's hostname-checking function is starting feel more and more appetizing. At least it won't be our problem then.) It would be good to add a little helper function that does the NULL-check, straight comparison, and wildcard check, for a single name. And then use that for the Common Name and all the Alternatives. That'll ensure that all the same rules apply whether the name is the Common Name or an Alternative (assuming that the rules are supposed to be the same; I don't know if that's true). Thanks, common code has been moved into a separate new function. Another question is how should we treat the certificates with no CN and non-empty SAN? Current code just bails out right after finding no CN section present , but the RFC (5280) says: Further, if the only subject identity included in the certificate is an alternative name form (e.g., an electronic mail address), then the subject distinguished name MUST be empty (an empty sequence), and the subjectAltName extension MUST be present. which to me sounds like the possibility of coming across such certificates in the wild, although I personally see little use in them. Yeah, I think a certificate without CN should be supported. See also RFC 6125, section 4.1. Rules [for issuers of certificates]: 5. Even though many deployed clients still check for the CN-ID within the certificate subject field, certification authorities are encouraged to migrate away from issuing certificates that represent the server's fully qualified DNS domain name in a CN-ID. Therefore, the certificate SHOULD NOT include a CN-ID unless the certification authority issues the certificate in accordance with a specification that reuses this one and that explicitly encourages continued support for the CN-ID identifier type in the context of a given application technology. Certificates without a CN-ID are probably rare today, but they might start to appear in the future. BTW, should we also support alternative subject names in the server, in client certificates? And if there are multiple names, which one takes effect? Perhaps better to leave that for a separate patch. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql \watch versus \timing
On 08/28/2014 02:46 PM, Fujii Masao wrote: On Tue, Aug 26, 2014 at 4:55 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/25/2014 10:48 PM, Heikki Linnakangas wrote: Actually, perhaps it would be better to just copy-paste PSQLexec, and modify the copy to suite \watch's needs. (PSQLexecWatch? SendWatchQuery?). PSQLexec doesn't do much, and there isn't very much overlap between what \watch wants and what other PSQLexec callers want. \watch wants timing output, others don't. \watch doesn't want transaction handling. Agreed. Attached is the revised version of the patch. I implemented PSQLexecWatch() which sends the query, prints the results and outputs the query execution time (if \timing is enabled). This patch was marked as ready for committer, but since I revised the code very much, I marked this as needs review again. This comment: ... We use PSQLexecWatch, !* which is kind of cheating, but SendQuery doesn't let us suppress !* autocommit behavior. is a bit strange now. PSQLexecWatch isn't cheating like reusing PSQLexec was; it's whole purpose is to run \watch queries. /* * Set up cancellation of 'watch' via SIGINT. We redo this each time * through the loop since it's conceivable something inside PSQLexec * could change sigint_interrupt_jmp. */ This should now say PSQLexecWatch. Other than that, looks good to me. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] v4 protocol TODO item - Lazy fetch/stream of TOASTed values?
On 08/28/2014 02:14 PM, Craig Ringer wrote: Hi all Per the protocol todo: https://wiki.postgresql.org/wiki/Todo#Wire_Protocol_Changes do you think it's reasonable to allow for delayed sending of big varlena values and arrays in the protocol? Yes. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 08/27/2014 08:05 PM, Fabien COELHO wrote: [...] Yeah, something like that. I don't think it would be necessary to set statement_timeout, you can inject that in your script or postgresql.conf if you want. I don't think aborting a transaction that's already started is necessary either. You could count it as LATE, but let it finish first. I've implemented something along these simplified lines. The latency is not limited as such, but slow (over the limit) queries are counted and reported. Ok, thanks. This now begs the question: In --rate mode, shouldn't the reported transaction latency also be calculated from the *scheduled* start time, not the time the transaction actually started? Otherwise we're using two different definitions of latency, one for the purpose of the limit, and another for reporting. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LIMIT for UPDATE and DELETE
(2014/08/25 15:48), Etsuro Fujita wrote: (2014/08/15 6:18), Rukh Meski wrote: Based on the feedback on my previous patch, I've separated only the LIMIT part into its own feature. This version plays nicely with inheritance. The intended use is splitting up big UPDATEs and DELETEs into batches more easily and efficiently. Before looking into the patch, I'd like to know the use cases in more details. Thanks for the input, Amit, Kevin and Jeff! I understand that the patch is useful. I've looked at the patch a bit closely. Here is my initial thought about the patch. The patch places limit-counting inside ModifyTable, and works well for inheritance trees, but I'm not sure that that is the right way to go. I think that this feature should be implemented in the way that we can naturally extend it to the ORDER-BY-LIMIT case in future. But honestly the patch doesn't seem to take into account that, I might be missing something, though. What plan do you have for the future extensibility? I think that the approach discussed in [1] would be promissing, so ISTM that it would be better to implement this feature by allowing for plans in the form of eg, ModifyTModifyTable+Limit+Append. Thanks, [1] http://www.postgresql.org/message-id/26819.1291133...@sss.pgh.pa.us Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
Hello Heikki, This now begs the question: In --rate mode, shouldn't the reported transaction latency also be calculated from the *scheduled* start time, not the time the transaction actually started? Otherwise we're using two different definitions of latency, one for the purpose of the limit, and another for reporting. It could. Find a small patch **on top of v5** which does that. I've tried to update the documentation accordingly as well. Note that the information is already there as the average lag time is reported, ISTM that: avg latency2 ~ avg lag + avg latency1 so it is just a matter of choice, both are ok somehow. I would be fine with both. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 96e5fb9..40427a3 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -1125,12 +1125,24 @@ top: commands[st-state + 1] == NULL) { instr_time diff; - int64 latency, now; + int64 latency; INSTR_TIME_SET_CURRENT(diff); - now = INSTR_TIME_GET_MICROSEC(diff); - INSTR_TIME_SUBTRACT(diff, st-txn_begin); - latency = INSTR_TIME_GET_MICROSEC(diff); + + if (throttle_delay) + { +/* Under throttling, compute latency wrt to the initial schedule, + * not the actual transaction start. + */ +int64 now = INSTR_TIME_GET_MICROSEC(diff); +latency = now - thread-throttle_trigger; + } + else + { +INSTR_TIME_SUBTRACT(diff, st-txn_begin); +latency = INSTR_TIME_GET_MICROSEC(diff); + } + st-txn_latencies += latency; /* @@ -1144,16 +1156,8 @@ top: /* record over the limit transactions if needed. */ - if (latency_limit) - { -/* Under throttling, late means wrt to the initial schedule, - * not the actual transaction start - */ -if (throttle_delay) - latency = now - thread-throttle_trigger; -if (latency latency_limit) - thread-latency_late++; - } + if (latency_limit latency latency_limit) +thread-latency_late++; } /* diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml index f80116f..453ae4b 100644 --- a/doc/src/sgml/pgbench.sgml +++ b/doc/src/sgml/pgbench.sgml @@ -420,8 +420,10 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ Show progress report every literalsec/ seconds. The report includes the time since the beginning of the run, the tps since the last report, and the transaction latency average and standard -deviation since the last report. Under throttling (option-R/), it -also includes the average schedule lag time since the last report. +deviation since the last report. Under throttling (option-R/), +the latency is computed with respect to the transaction scheduled +start time, not the actual transaction beginning time, and it also +includes the average schedule lag time since the last report. /para /listitem /varlistentry -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LIMIT for UPDATE and DELETE
On 8/29/14 12:20 PM, Etsuro Fujita wrote: The patch places limit-counting inside ModifyTable, and works well for inheritance trees, but I'm not sure that that is the right way to go. I think that this feature should be implemented in the way that we can naturally extend it to the ORDER-BY-LIMIT case in future. But honestly the patch doesn't seem to take into account that, I might be missing something, though. The LIMIT part *has* to happen after the rows have been locked or it will work very surprisingly under concurrency (sort of like how FOR SHARE / FOR UPDATE worked before 9.0). So either it has to be inside ModifyTable or the ModifyTable has to somehow pass something to a Limit node on top of it which would then realize that the tuples from ModifyTable aren't supposed to be sent to the client (unless there's a RETURNING clause). I think it's a lot nicer to do the LIMITing inside ModifyTable, even though that duplicates a small portion of code that already exists in the Limit node. What plan do you have for the future extensibility? I think that the approach discussed in [1] would be promissing, so ISTM that it would be better to implement this feature by allowing for plans in the form of eg, ModifyTModifyTable+Limit+Append. I don't see an approach discussed there, just a listing of problems with no solutions. This is just my personal opinion, but what I think should happen is: 1) We put the LIMIT inside ModifyTable like this patch does. This doesn't prevent us from doing ORDER BY in the future, but helps numerous people who today have to 2) We allow ORDER BY on tables with no inheritance children using something similar to Rukh's previous patch. 3) Someone rewrites how UPDATE works based on Tom's suggestion here: http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us, which allows us to support ORDER BY on all tables (or perhaps maybe not FDWs, I don't know how those work). The LIMIT functionality in this patch is unaffected. Now, I know some people disagree with me on whether step #2 is worth taking or not, but that's a separate discussion. My point w.r.t. this patch still stands: I don't see any forwards compatibility problems with this approach, nor do I really see any viable alternatives either. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LIMIT for UPDATE and DELETE
On 8/29/14 1:53 PM, I wrote: This is just my personal opinion, but what I think should happen is: 1) We put the LIMIT inside ModifyTable like this patch does. This doesn't prevent us from doing ORDER BY in the future, but helps numerous people who today have to Oops, looks like I didn't finish my thought here. .. but helps numerous people who today have to achieve the same thing via tedious, slow and problematic subqueries (or a choose-any-two combination of these). .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question about coding of free space map
On 08/26/2014 05:13 AM, Tatsuo Ishii wrote: While looking into backend/storage/freespace/freespace.c, I noticed that struct FSMAddress is passed to functions by value, rather than reference. I thought our code practice is defining pointer to a struct data and using the pointer for parameter passing etc. typedef struct RelationData *Relation; IMO freespace.c is better to follow the practice. There isn't really any strict coding rule on that. We pass RelFileNode's by value in many functions, for example. IMHO it's clearer to pass small structs like this by value. For example, it irritates me that we tend to pass ItemPointers by reference, when it's a small struct that represents a single value. I think of it as an atomic value, like an integer, so it feels wrong to pass it by reference. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Fix xpath() to return namespace definitions
Greetings, Because of the memory bug in xmlCopyNode, this is a new patch with different method. In this patch, instead of using xmlCopyNode to bring the namespace back, we added the required namespaces to the node before dumping the node to string, and cleaning it up afterwards (because the same node could be dumped again in next xpath result). Thanks, Ali Akbar 2014-07-11 15:36 GMT+07:00 Ali Akbar the.ap...@gmail.com: Greetings, Attached modified patch to handle xmlCopyNode returning NULL. The patch is larger because xmlerrcxt must be passed to xml_xmlnodetoxmltype (xmlerrcxt is needed for calling xml_ereport). From libxml2 source, it turns out that the other cases that xmlCopyNode will return NULL will not happen. So in this patch i assume that the only case is memory exhaustion. But i found some bug in libxml2's code, because we call xmlCopyNode with 1 as the second parameter, internally xmlCopyNode will call xmlStaticCopyNode recursively via xmlStaticCopyNodeList. And xmlStaticCopyNodeList doesn't check the return of xmlStaticCopyNode whether it's NULL. So if someday the recursion returns NULL (maybe because of memory exhaustion), it will SEGFAULT. Knowing this but in libxml2 code, is this patch is still acceptable? Thanks, Ali Akbar -- Ali Akbar diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 422be69..f34c87e 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -141,9 +141,11 @@ static bool print_xml_decl(StringInfo buf, const xmlChar *version, pg_enc encoding, int standalone); static xmlDocPtr xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace, int encoding); -static text *xml_xmlnodetoxmltype(xmlNodePtr cur); +static text *xml_xmlnodetoxmltype(xmlNodePtr cur, + PgXmlErrorContext *xmlerrcxt); static int xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj, - ArrayBuildState **astate); + ArrayBuildState **astate, + PgXmlErrorContext *xmlerrcxt); #endif /* USE_LIBXML */ static StringInfo query_to_xml_internal(const char *query, char *tablename, @@ -3590,27 +3592,109 @@ SPI_sql_row_to_xmlelement(int rownum, StringInfo result, char *tablename, #ifdef USE_LIBXML +/* check ns definition of node and its childrens. If any one of ns is + * not defined in node and it's children, but in the node's parent, + * copy the definition to node. + */ +static void +xml_checkandcopyns(xmlNodePtr root, + PgXmlErrorContext *xmlerrcxt, + xmlNodePtr node, + xmlNsPtr lastns_before) +{ + xmlNsPtr ns = NULL; + xmlNsPtr cur_ns; + xmlNodePtr cur; + + if (node-ns != NULL) + { + /* check in new nses */ + cur_ns = lastns_before == NULL ? node-nsDef : lastns_before-next; + while (cur_ns != NULL) + { + if (cur_ns-href != NULL) + { +if (((cur_ns-prefix == NULL) (node-ns-prefix == NULL)) || + ((cur_ns-prefix != NULL) (node-ns-prefix != NULL) + xmlStrEqual(cur_ns-prefix, node-ns-prefix))) +{ + ns = cur_ns; + break; +} + } + cur_ns = cur_ns-next; + } + if (ns == NULL) /* not in new nses */ + { + ns = xmlSearchNs(NULL, node-parent, node-ns-prefix); + + if (ns != NULL) { +ns = xmlNewNs(root, ns-href, ns-prefix); + +if (ns == NULL xmlerrcxt-err_occurred) { + xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY, + could not allocate xmlNs); +} + } + } + } + /* check and copy ns for children recursively */ + cur = node-children; + while (cur != NULL) { + xml_checkandcopyns(root, xmlerrcxt, cur, lastns_before); + cur = cur-next; + } +} + /* * Convert XML node to text (dump subtree in case of element, * return value otherwise) */ static text * -xml_xmlnodetoxmltype(xmlNodePtr cur) +xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt) { xmltype*result; if (cur-type == XML_ELEMENT_NODE) { xmlBufferPtr buf; + xmlNsPtr lastns_before; + xmlNsPtr ns; + xmlNsPtr next; buf = xmlBufferCreate(); + PG_TRY(); { + lastns_before = cur-nsDef; + if (lastns_before != NULL) { +while (lastns_before-next != NULL) { + lastns_before = lastns_before-next; +} + } + xml_checkandcopyns(cur, xmlerrcxt, cur, lastns_before); + xmlNodeDump(buf, NULL, cur, 0, 1); result = xmlBuffer_to_xmltype(buf); + + /* delete and free new nses */ + ns = lastns_before == NULL ? cur-nsDef : lastns_before-next; + while (ns != NULL) { +next = ns-next; +xmlFree(ns); +ns = next; + } + if (lastns_before == NULL) { +cur-nsDef = NULL; + } else { +lastns_before-next = NULL; + } } PG_CATCH(); { + /* new namespaces will be freed while free-ing the node, so we + * won't free it here + */ xmlBufferFree(buf); PG_RE_THROW(); } @@ -3656,7 +3740,8 @@ xml_xmlnodetoxmltype(xmlNodePtr cur) */ static int xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj, - ArrayBuildState **astate) + ArrayBuildState
[HACKERS] Misleading error message in logical decoding for binary plugins
Hi all, Using a plugin producing binary output, I came across this error: =# select data from pg_logical_slot_peek_changes('foo', NULL, NULL); ERROR: 0A000: output plugin cannot produce binary output LOCATION: pg_logical_slot_get_changes_guts, logicalfuncs.c:404 Shouldn't the error message be here something like binary output plugin cannot produce textual output? The plugin used in my case produces binary output, but what is requested from it with pg_logical_slot_peek_changes is textual output. A patch is attached (with s/pluggin/plugin in bonus). Comments welcome. Regards, -- Michael diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c index a3a58e6..310b1e5 100644 --- a/src/backend/replication/logical/logicalfuncs.c +++ b/src/backend/replication/logical/logicalfuncs.c @@ -394,14 +394,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin MemoryContextSwitchTo(oldcontext); /* - * Check whether the output pluggin writes textual output if that's + * Check whether the output plugin writes textual output if that's * what we need. */ if (!binary ctx-options.output_type != OUTPUT_PLUGIN_TEXTUAL_OUTPUT) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg(output plugin cannot produce binary output))); + errmsg(binary output plugin cannot produce textual output))); ctx-output_writer_private = p; diff --git a/src/include/catalog/objectaccess.h b/src/include/catalog/objectaccess.h index 4fdd056..1e1cb13 100644 --- a/src/include/catalog/objectaccess.h +++ b/src/include/catalog/objectaccess.h @@ -13,7 +13,7 @@ /* * Object access hooks are intended to be called just before or just after * performing certain actions on a SQL object. This is intended as - * infrastructure for security or logging pluggins. + * infrastructure for security or logging plugins. * * OAT_POST_CREATE should be invoked just after the object is created. * Typically, this is done after inserting the primary catalog records and -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange
Mark Kirkwood wrote: On 29/08/14 08:56, Alvaro Herrera wrote: Robert Haas wrote: I agree that you might not like that. But you might not like having the table vacuumed slower than the configured rate, either. My impression is that the time between vacuums isn't really all that negotiable for some people. I had one customer who had horrible bloat issues on a table that was vacuumed every minute; when we changed the configuration so that it was vacuumed every 15 seconds, those problems went away. Wow, that's extreme. For that case you can set autovacuum_vacuum_cost_limit to 0, which disables the whole thing and lets vacuum run at full speed -- no throttling at all. Would that satisfy the concern? Well no - you might have a whole lot of big tables that you want vacuum to not get too aggressive on, but a few small tables that are highly volatile. So you want *them* vacuumed really fast to prevent them becoming huge tables with only a few rows therein, but your system might not be able to handle *all* your tables being vacuum full speed. I meant setting cost limit to 0 *for those tables* only, not for all of them. Anyway it seems to me maybe there is room for a new table storage parameter, say autovacuum_do_balance which means to participate in the balancing program or not. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misleading error message in logical decoding for binary plugins
On 2014-08-29 22:42:46 +0900, Michael Paquier wrote: Hi all, Using a plugin producing binary output, I came across this error: =# select data from pg_logical_slot_peek_changes('foo', NULL, NULL); ERROR: 0A000: output plugin cannot produce binary output LOCATION: pg_logical_slot_get_changes_guts, logicalfuncs.c:404 Shouldn't the error message be here something like binary output plugin cannot produce textual output? The plugin used in my case produces binary output, but what is requested from it with pg_logical_slot_peek_changes is textual output. I don't like the new message much. It's imo even more misleading than the old message. How about output pluing produces binary output but the sink only accepts textual data? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misleading error message in logical decoding for binary plugins
On Fri, Aug 29, 2014 at 10:48 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-29 22:42:46 +0900, Michael Paquier wrote: Hi all, Using a plugin producing binary output, I came across this error: =# select data from pg_logical_slot_peek_changes('foo', NULL, NULL); ERROR: 0A000: output plugin cannot produce binary output LOCATION: pg_logical_slot_get_changes_guts, logicalfuncs.c:404 Shouldn't the error message be here something like binary output plugin cannot produce textual output? The plugin used in my case produces binary output, but what is requested from it with pg_logical_slot_peek_changes is textual output. I don't like the new message much. It's imo even more misleading than the old message. How about output plugin produces binary output but the sink only accepts textual data? Sounds fine for me. I am sure that native English speakers will come up as well with additional suggestions. Btw, I am having a hard time understanding in which case OUTPUT_PLUGIN_BINARY_OUTPUT is actually useful. Looking at the code it is only a subset of what OUTPUT_PLUGIN_TEXTUAL_OUTPUT can do as textual can generate both binary and textual output, while binary can only generate binary output. The only code path where a flag OUTPUT_PLUGIN_* is used is in this code path for this very specific error message. On top of that, a logical receiver receives data in textual form even if the decoder is marked as binary when getting a message with PQgetCopyData. Perhaps I am missing something... But in this case I think that the documentation should have a more detailed explanation about the output format options. Currently only the option value is mentioned, and there is nothing about what they actually do. This is error-prone for the user. Regards, -- Michael
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
Jeff Davis pg...@j-davis.com writes: I have a new approach to the patch which is to call a callback at each block allocation and child contexts inherit the callback from their parents. The callback could be defined to simply dereference and increment its argument, which would mean block allocations anywhere in the hierarchy are incrementing the same variable, avoiding the need to traverse the hierarchy. Cute, but it's impossible to believe that a function call isn't going to be *even more* overhead than what you've got now. This could only measure out as a win when the feature is going unused. What about removing the callback per se and just keeping the argument, as it were. That is, a MemoryContext has a field of type size_t * that is normally NULL, but if set to non-NULL, then we increment the pointed-to value for pallocs and decrement for pfrees. One thought in either case is that we don't have to touch the API for MemoryContextCreate: rather, the feature can be enabled in a separate call after making the context. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing dependency to wsock32.lib when compiling code on WIndows
On Thu, Aug 21, 2014 at 11:29:31PM -0400, Noah Misch wrote: On Thu, Aug 21, 2014 at 09:18:22AM -0400, Andrew Dunstan wrote: What's happening about this? Buildfarm animal jacana is consistently red because of this. If nobody plans to do the aforementioned analysis in the next 4-7 days, I suggest we adopt one of Michael's suggestions: force configure to reach its old conclusion about getaddrinfo() on Windows. Then the analysis can proceed on an unhurried schedule. Done. Incidentally, jacana takes an exorbitant amount of time, most recently 87 minutes, to complete the ecpg-check step. Frogmouth takes 4 minutes, and none of the other steps have such a large multiplier between the two animals. That pattern isn't new and appears on multiple branches. I wonder if ecpg tickles a performance bug in 64-bit MinGW-w64. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misleading error message in logical decoding for binary plugins
On 2014-08-29 23:07:44 +0900, Michael Paquier wrote: On Fri, Aug 29, 2014 at 10:48 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-29 22:42:46 +0900, Michael Paquier wrote: Hi all, Using a plugin producing binary output, I came across this error: =# select data from pg_logical_slot_peek_changes('foo', NULL, NULL); ERROR: 0A000: output plugin cannot produce binary output LOCATION: pg_logical_slot_get_changes_guts, logicalfuncs.c:404 Shouldn't the error message be here something like binary output plugin cannot produce textual output? The plugin used in my case produces binary output, but what is requested from it with pg_logical_slot_peek_changes is textual output. I don't like the new message much. It's imo even more misleading than the old message. How about output plugin produces binary output but the sink only accepts textual data? Sounds fine for me. I am sure that native English speakers will come up as well with additional suggestions. Btw, I am having a hard time understanding in which case OUTPUT_PLUGIN_BINARY_OUTPUT is actually useful. Looking at the code it is only a subset of what OUTPUT_PLUGIN_TEXTUAL_OUTPUT can do as textual can generate both binary and textual output, while binary can only generate binary output. No, a textual output plugin is *NOT* allowed to produce binary output. That'd violate e.g. pg_logical_slot_peek_changes's return type because it's only declared to return text. If you compile with assertions enabled not adhering to that will actually result in an error: /* * Assert ctx-out is in database encoding when we're writing textual * output. */ if (!p-binary_output) Assert(pg_verify_mbstr(GetDatabaseEncoding(), ctx-out-data, ctx-out-len, false)); The only code path where a flag OUTPUT_PLUGIN_* is used is in this code path for this very specific error message. Well, LogicalDecodingContext-binary_output is checked too. On top of that, a logical receiver receives data in textual form even if the decoder is marked as binary when getting a message with PQgetCopyData. I have no idea what you mean by that. The data is sent in the format the output plugin writes the data in. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misleading error message in logical decoding for binary plugins
On Fri, Aug 29, 2014 at 11:15 PM, Andres Freund and...@2ndquadrant.com wrote: No, a textual output plugin is *NOT* allowed to produce binary output. That'd violate e.g. pg_logical_slot_peek_changes's return type because it's only declared to return text. A textual output plugin can call pg_logical_slot_peek_binary_changes and pg_logical_slot_peek_changes as well, and a binary output plugin can only call pg_logical_slot_peek_binary_changes, and will error out with pg_logical_slot_peek_changes: =# select pg_create_logical_replication_slot('foo', 'test_decoding'); pg_create_logical_replication_slot (foo,0/16C6880) (1 row) =# create table aa as select 1; SELECT 1 =# select substring(encode(data, 'escape'), 1, 20), substring(data, 1, 20) FROM pg_logical_slot_peek_binary_changes('foo', NULL, NULL); substring | substring --+ BEGIN 1000 | \x424547494e2031303030 table public.aa: INS | \x7461626c65207075626c69632e61613a20494e53 COMMIT 1000 | \x434f4d4d49542031303030 (3 rows) =# select pg_logical_slot_peek_changes('foo', NULL, NULL, 'force-binary', 'true'); ERROR: 0A000: output plugin cannot produce binary output LOCATION: pg_logical_slot_get_changes_guts, logicalfuncs.c:404 =# select substring(data, 1, 20) from pg_logical_slot_peek_binary_changes('foo', NULL, NULL, 'force-binary', 'true'); substring \x424547494e2031303030 \x7461626c65207075626c69632e61613a20494e53 \x434f4d4d49542031303030 (3 rows) Is that expected? -- Michael
Re: [HACKERS] LIMIT for UPDATE and DELETE
Marko Tiikkaja ma...@joh.to writes: The LIMIT part *has* to happen after the rows have been locked or it will work very surprisingly under concurrency (sort of like how FOR SHARE / FOR UPDATE worked before 9.0). Good point. So either it has to be inside ModifyTable or the ModifyTable has to somehow pass something to a Limit node on top of it ... or we add a LockRows node below the Limit node. Yeah, that would make UPDATE/LIMIT a tad slower, but I think that might be preferable to what you're proposing anyway. Raw speed of what is fundamentally a fringe feature ought not trump every other concern. This is just my personal opinion, but what I think should happen is: 1) We put the LIMIT inside ModifyTable like this patch does. This doesn't prevent us from doing ORDER BY in the future, but helps numerous people who today have to 2) We allow ORDER BY on tables with no inheritance children using something similar to Rukh's previous patch. 3) Someone rewrites how UPDATE works based on Tom's suggestion here: http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us, which allows us to support ORDER BY on all tables (or perhaps maybe not FDWs, I don't know how those work). The LIMIT functionality in this patch is unaffected. I still think we should skip #2 and go directly to work on #3. Getting rid of the unholy mess that is inheritance_planner would be a very nice thing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question about coding of free space map
Heikki Linnakangas hlinnakan...@vmware.com writes: On 08/26/2014 05:13 AM, Tatsuo Ishii wrote: While looking into backend/storage/freespace/freespace.c, I noticed that struct FSMAddress is passed to functions by value, rather than reference. There isn't really any strict coding rule on that. We pass RelFileNode's by value in many functions, for example. The only reason RelFileNodes work like that is that Robert blithely ignored the coding rule when he was revising things to pass those around. I've never been terribly happy about it, but it wasn't important enough to complain about. The cases where it *would* be important enough to complain about would be performance-critical code paths, which RelFileNode usages typically don't appear in (if you're messing with one you're most likely going to do a filesystem access). I'd be unhappy though if someone wanted to start passing ItemPointers by value. I doubt we can rely on C compilers to pass those as efficiently as they pass pointers. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misleading error message in logical decoding for binary plugins
On 2014-08-29 23:31:49 +0900, Michael Paquier wrote: On Fri, Aug 29, 2014 at 11:15 PM, Andres Freund and...@2ndquadrant.com wrote: No, a textual output plugin is *NOT* allowed to produce binary output. That'd violate e.g. pg_logical_slot_peek_changes's return type because it's only declared to return text. A textual output plugin can call pg_logical_slot_peek_binary_changes and pg_logical_slot_peek_changes as well, Well, for one a output plugin doesn't call pg_logical_slot_peek_binary_changes, it's the other way round. And sure: Every text is also binary. So that's just fine. and a binary output plugin can only call pg_logical_slot_peek_binary_changes, and will error out with pg_logical_slot_peek_changes: =# select pg_create_logical_replication_slot('foo', 'test_decoding'); pg_create_logical_replication_slot (foo,0/16C6880) (1 row) =# create table aa as select 1; SELECT 1 =# select substring(encode(data, 'escape'), 1, 20), substring(data, 1, 20) FROM pg_logical_slot_peek_binary_changes('foo', NULL, NULL); substring | substring --+ BEGIN 1000 | \x424547494e2031303030 table public.aa: INS | \x7461626c65207075626c69632e61613a20494e53 COMMIT 1000 | \x434f4d4d49542031303030 (3 rows) =# select pg_logical_slot_peek_changes('foo', NULL, NULL, 'force-binary', 'true'); ERROR: 0A000: output plugin cannot produce binary output LOCATION: pg_logical_slot_get_changes_guts, logicalfuncs.c:404 =# select substring(data, 1, 20) from pg_logical_slot_peek_binary_changes('foo', NULL, NULL, 'force-binary', 'true'); substring \x424547494e2031303030 \x7461626c65207075626c69632e61613a20494e53 \x434f4d4d49542031303030 (3 rows) Is that expected? Yes. The output plugin declares whether it requires the *output method* to support binary data. pg_logical_slot_peek_changes *can not* support binary data because it outputs data as text. pg_logical_slot_peek_binary_changes *can* support binary data because it returns bytea (and thus it also can output text, because that's essentially a subset of binary data). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing dependency to wsock32.lib when compiling code on WIndows
On 08/29/2014 10:15 AM, Noah Misch wrote: On Thu, Aug 21, 2014 at 11:29:31PM -0400, Noah Misch wrote: On Thu, Aug 21, 2014 at 09:18:22AM -0400, Andrew Dunstan wrote: What's happening about this? Buildfarm animal jacana is consistently red because of this. If nobody plans to do the aforementioned analysis in the next 4-7 days, I suggest we adopt one of Michael's suggestions: force configure to reach its old conclusion about getaddrinfo() on Windows. Then the analysis can proceed on an unhurried schedule. Done. Incidentally, jacana takes an exorbitant amount of time, most recently 87 minutes, to complete the ecpg-check step. Frogmouth takes 4 minutes, and none of the other steps have such a large multiplier between the two animals. That pattern isn't new and appears on multiple branches. I wonder if ecpg tickles a performance bug in 64-bit MinGW-w64. Good pickup. I wonder what it could be. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why data of timestamptz does not store value of timezone passed to it?
Tom Lane t...@sss.pgh.pa.us wrote: Kevin Grittner kgri...@ymail.com writes: But the standard doesn't say anything about storing a time zone *name* or *abbreviation* -- it requires that it be stored as UTC with the *offset* (in hours and minutes). That makes it pretty close to what we have -- it's all about a difference in presentation. And as far as I can see it completely dodges the kinds of problems you're talking about. However, the added field creates its own semantic problems. As an example, is 2014-08-28 18:00:00-04 the same as or different from 2014-08-28 17:00:00-05? If they're different, which one is less? If they're the same, what's the point of storing the extra field? And do you really like equal values that behave differently, not only for I/O but for operations such as EXTRACT()? (I imagine the SQL spec gives a ruling on this issue, which I'm too lazy to look up; my point is that whatever they did, it will be the wrong thing for some use-cases.) I think (based on your earlier post) that we agree that would have been better to implement the type named in the standard according to the definition given in the standard (and to use a new type name for the more generally useful behavior PostgreSQL currently uses for timestamptz), but that it's too late to go there now. QUEL's relational calculus is superior in just about every way to SQL, but if we're going to go with the standard because it *is* a standard, then let's freaking *do* it and extend where beneficial. Otherwise, why switch from QUEL in the first place? It was actually rather disappointing to hear that we had a conforming implementation and changed away from it circa the 7.2 release; and even more disturbing to hear that decision is still being defended on the grounds that there's no point providing standard conforming behavior if we can think of different behavior that we feel is more useful. We should have both. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Why data of timestamptz does not store value of timezone passed to it?
On Fri, Aug 29, 2014 at 4:03 PM, Kevin Grittner kgri...@ymail.com wrote: It was actually rather disappointing to hear that we had a conforming implementation and changed away from it circa the 7.2 release; and even more disturbing to hear that decision is still being defended on the grounds that there's no point providing standard conforming behavior if we can think of different behavior that we feel is more useful. We should have both. I don't think the behaviour was standards-compliant in 7.2 either. For that matter, I can't think of any circumstance where the standard behaviour is useful. There's absolutely no way to write correct code using it. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misleading error message in logical decoding for binary plugins
On Fri, Aug 29, 2014 at 11:39 PM, Andres Freund and...@2ndquadrant.com wrote: Yes. The output plugin declares whether it requires the *output method* to support binary data. pg_logical_slot_peek_changes *can not* support binary data because it outputs data as text. pg_logical_slot_peek_binary_changes *can* support binary data because it returns bytea (and thus it also can output text, because that's essentially a subset of binary data). Thanks for the explanations. This would meritate more details within the docs, like what those two modes actually do and what the user can expect as differences, advantages and disadvantages if he chooses one or the other when starting decoding... -- Michael
Re: [HACKERS] Why data of timestamptz does not store value of timezone passed to it?
Kevin Grittner kgri...@ymail.com writes: It was actually rather disappointing to hear that we had a conforming implementation and changed away from it circa the 7.2 release; That is not the case. The existing implementation is work that Tom Lockhart did around 6.3 or so. It was called timestamp at the time, and was renamed to timestamp with time zone in 7.2, in order to make room for timestamp without time zone (which I think *is* spec compliant or close enough). That was probably an unfortunate choice; but at no time was there code in PG that did what the spec says timestamp with time zone should do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Why data of timestamptz does not store value of timezone passed to it?
On Fri, Aug 29, 2014 at 11:12 AM, Greg Stark [via PostgreSQL] ml-node+s1045698n5816903...@n5.nabble.com wrote: On Fri, Aug 29, 2014 at 4:03 PM, Kevin Grittner [hidden email] http://user/SendEmail.jtp?type=nodenode=5816903i=0 wrote: It was actually rather disappointing to hear that we had a conforming implementation and changed away from it circa the 7.2 release; and even more disturbing to hear that decision is still being defended on the grounds that there's no point providing standard conforming behavior if we can think of different behavior that we feel is more useful. We should have both. I don't think the behaviour was standards-compliant in 7.2 either. For that matter, I can't think of any circumstance where the standard behaviour is useful. There's absolutely no way to write correct code using it. And forcing people to change their data types to migrate to PostgreSQL is undesirable IF our type is usefully equivalent to others in the majority of situations - though I don't know if that is actually the case. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Why-data-of-timestamptz-does-not-store-value-of-timezone-passed-to-it-tp5816703p5816906.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
[HACKERS] Re: Why data of timestamptz does not store value of timezone passed to it?
On Fri, Aug 29, 2014 at 4:19 PM, David G Johnston david.g.johns...@gmail.com wrote: And forcing people to change their data types to migrate to PostgreSQL is undesirable IF our type is usefully equivalent to others in the majority of situations - though I don't know if that is actually the case. You know... I wonder if we have enough leverage in the standards committee these days that we could usefully push that direction instead of being pushed around. The standard timestamp with time zone is not very useful and I'm sure the standards committee wouldn't mind having a useful point-in-time data type. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] On partitioning
Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja reference Tom's post http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us which mentions the possibility of a different partitioning implementation than what we have so far. As it turns out, I've been thinking about partitioning recently, so I thought I would share what I'm thinking so that others can poke holes. My intention is to try to implement this as soon as possible. Declarative partitioning In this design, partitions are first-class objects, not normal tables in inheritance hierarchies. There are no pg_inherits entries involved at all. Partitions are a physical implementation detail. Therefore we do not allow the owner to be changed, or permissions to be granted directly to partitions; all these operations happen to the parent relation instead. System Catalogs --- In pg_class we have two additional relkind values: * relkind RELKIND_PARTITIONED_REL 'P' indicates a partitioned relation. It is used to indicate a parent table, i.e. one the user can directly address in DML queries. Such relations DO NOT have their own storage. These use the same rules as regular tables for access privileges, ownership and so on. * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned relation (its parent). These cannot be addressed directly in DML queries and only limited DDL support is provided. They don't have their own pg_attribute entries either and therefore they are always identical in column definitions to the parent relation. Since they are not accessible directly, there is no need for ACL considerations; the parent relation's owner is the owner, and grants are applied to the parent relation only. XXX --- is there a need for a partition having different column default values than its parent relation? Partitions are numbered sequentially, normally from 1 onwards; but it is valid to have negative partition numbers and 0. Partitions don't have names (except automatically generated ones for pg_class.relname, but they are unusable in DDL). Each partition is assigned an Expression that receives a tuple and returns boolean. This expression returns true if a given tuple belongs into it, false otherwise. If a tuple for a partitioned relation is run through expressions of all partitions, exactly one should return true. If none returns true, it might be because the partition has not been created yet. A user-facing error is raised in this case (Rationale: if user creates a partitioned rel and there is no partition that accepts some given tuple, it's the user's fault.) Additionally, each partitioned relation may have a master expression. This receives a tuple and returns an integer, which corresponds to the number of the partition it belongs into. There are two new system catalogs: pg_partitioned_rel -- (prelrelid, prelexpr) pg_partition -- (partrelid, partseq, partexpr, partoverflow) For partitioned rels that have prelexpr, we run that expression and obtain the partition number; as a crosscheck we run partexpr and ensure it returns true. For partitioned rels that don't have prelexpr, we run partexpr for each partition in turn until one returns true. This means that for a properly set up partitioned table, we need to run a single expression on a tuple to find out what partition the tuple belongs into. Per-partition expressions are formed as each partition is created, and are based on the user-supplied partitioning criterion. Master expressions are formed at relation creation time. (XXX Can we change the master expression later, as a result of some ALTER command? Presumably this would mean that all partitions might need to be rewritten.) Triggers (These are user-defined triggers, not partitioning triggers. In fact there are no partitioning triggers at all.) Triggers are attached to the parent relation, not to the specific partition. When a trigger function runs on a tuple inserted, updated or modified on a partition, the data received by the trigger function makes it appear that the tuple belongs to the parent relation. There is no need to let the trigger know which partition the tuple went in or came from. XXX is there a need to give it the partition number that the tuple went it? Syntax -- CREATE TABLE xyz ( ... ) PARTITION BY RANGE ( a_expr ) This creates the main table only: no partitions are created automatically. We do not support other types of partitioning at this stage. We will implement these later. We do not currently support ALTER TABLE/PARTITION BY (i.e. partition a table after the fact). We leave this as a future improvement. Allowed actions on RELKIND_PARTITIONED_REL: * ALTER TABLE xyz CREATE PARTITION n This creates a new partition * ALTER TABLE xyz CREATE PARTITION FOR value Same as above; the partition number is determined automatically. Allowed actions on a RELKIND_PARTITION:
Re: [HACKERS] On partitioning
Alvaro Herrera alvhe...@2ndquadrant.com writes: [ partition sketch ] In this design, partitions are first-class objects, not normal tables in inheritance hierarchies. There are no pg_inherits entries involved at all. Hm, actually I'd say they are *not* first class objects; the problem with the existing design is exactly that child tables *are* first class objects. This is merely a terminology quibble though. * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned relation (its parent). These cannot be addressed directly in DML queries and only limited DDL support is provided. They don't have their own pg_attribute entries either and therefore they are always identical in column definitions to the parent relation. Not sure that not storing the pg_attribute rows is a good thing; but that's something that won't be clear till you try to code it. Each partition is assigned an Expression that receives a tuple and returns boolean. This expression returns true if a given tuple belongs into it, false otherwise. -1, in fact minus a lot. One of the core problems of the current approach is that the system, particularly the planner, hasn't got a lot of insight into exactly what the partitioning scheme is in a partitioned table built on inheritance. If you allow the partitioning rule to be a black box then that doesn't get any better. I want to see a design wherein the system understands *exactly* what the partitioning behavior is. I'd start with supporting range-based partitioning explicitly, and maybe we could add other behaviors such as hashing later. In particular, there should never be any question at all that there is exactly one partition that a given row belongs to, not more, not less. You can't achieve that with a set of independent filter expressions; a meta-rule that says exactly one of them should return true is an untrustworthy band-aid. (This does not preclude us from mapping the tuple through the partitioning rule and finding that the corresponding partition doesn't currently exist. I think we could view the partitioning rule as a function from tuples to partition numbers, and then we look in pg_class to see if such a partition exists.) Additionally, each partitioned relation may have a master expression. This receives a tuple and returns an integer, which corresponds to the number of the partition it belongs into. I guess this might be the same thing I'm arguing for, except that I say it is not optional but is *the* way you define the partitioning. And I don't really want black-box expressions even in this formulation. If you're looking for arbitrary partitioning rules, you can keep on using inheritance. The point of inventing partitioning, IMHO, is for the system to have a lot more understanding of the behavior than is possible now. As an example of the point I'm trying to make, the planner should be able to discard range-based partitions that are eliminated by a WHERE clause with something a great deal cheaper than the theorem prover it currently has to use for the purpose. Black-box partitioning rules not only don't improve that situation, they actually make it worse. Other than that, this sketch seems reasonable ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Aug 29, 2014 at 4:56 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: For scan plans, we need to prepare Append lists which are used to scan for tuples in a partitioned relation. We can setup fake constraint expressions based on the partitioning expressions, which let the planner discard unnecessary partitions by way of constraint exclusion. (In the future we might be interested in creating specialized plan and execution nodes that know more about partitioned relations, to avoid creating useless Append trees only to prune them later.) This seems like a big part of the point of doing first class partitions. If we have an equivalence class that specifies a constant for all the variables in the master expression then we should be able to look up the corresponding partition as a O(1) operation (or O(log(n) if it involves searching a list) rather than iterating over all the partitions and trying to prove lots of exclusions. We might even need a btree index to store the partitions so that we can handle scaling up and still find the corresponding partitions quickly. And I think there are still unanswered questions about indexes. You seem to be implying that users would be free to create any index they want on any partition. It's probably going to be necessary to support creating an index on the partitioned table which would create an index on each of the partitions and, crucially, automatically create corresponding indexes whenever new partitions are added. That said, everything that's here sounds pretty spot-on to me. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
2014-08-29 18:35 GMT+02:00 Tom Lane t...@sss.pgh.pa.us: Alvaro Herrera alvhe...@2ndquadrant.com writes: [ partition sketch ] In this design, partitions are first-class objects, not normal tables in inheritance hierarchies. There are no pg_inherits entries involved at all. Hm, actually I'd say they are *not* first class objects; the problem with the existing design is exactly that child tables *are* first class objects. This is merely a terminology quibble though. +1 .. only few partitions slowdown planning significantly from 1ms to 20ms, what is a issue with very simple queries over PK * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned relation (its parent). These cannot be addressed directly in DML queries and only limited DDL support is provided. They don't have their own pg_attribute entries either and therefore they are always identical in column definitions to the parent relation. Not sure that not storing the pg_attribute rows is a good thing; but that's something that won't be clear till you try to code it. Each partition is assigned an Expression that receives a tuple and returns boolean. This expression returns true if a given tuple belongs into it, false otherwise. -1, in fact minus a lot. One of the core problems of the current approach is that the system, particularly the planner, hasn't got a lot of insight into exactly what the partitioning scheme is in a partitioned table built on inheritance. If you allow the partitioning rule to be a black box then that doesn't get any better. I want to see a design wherein the system understands *exactly* what the partitioning behavior is. I'd start with supporting range-based partitioning explicitly, and maybe we could add other behaviors such as hashing later. In particular, there should never be any question at all that there is exactly one partition that a given row belongs to, not more, not less. You can't achieve that with a set of independent filter expressions; a meta-rule that says exactly one of them should return true is an untrustworthy band-aid. (This does not preclude us from mapping the tuple through the partitioning rule and finding that the corresponding partition doesn't currently exist. I think we could view the partitioning rule as a function from tuples to partition numbers, and then we look in pg_class to see if such a partition exists.) Additionally, each partitioned relation may have a master expression. This receives a tuple and returns an integer, which corresponds to the number of the partition it belongs into. I guess this might be the same thing I'm arguing for, except that I say it is not optional but is *the* way you define the partitioning. And I don't really want black-box expressions even in this formulation. If you're looking for arbitrary partitioning rules, you can keep on using inheritance. The point of inventing partitioning, IMHO, is for the system to have a lot more understanding of the behavior than is possible now. As an example of the point I'm trying to make, the planner should be able to discard range-based partitions that are eliminated by a WHERE clause with something a great deal cheaper than the theorem prover it currently has to use for the purpose. Black-box partitioning rules not only don't improve that situation, they actually make it worse. Other than that, this sketch seems reasonable ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Greg Stark st...@mit.edu writes: And I think there are still unanswered questions about indexes. One other interesting thought that occurs to me: are we going to support UPDATEs that cause a row to belong to a different partition? If so, how are we going to handle the update chain links? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Tom Lane wrote: Greg Stark st...@mit.edu writes: And I think there are still unanswered questions about indexes. One other interesting thought that occurs to me: are we going to support UPDATEs that cause a row to belong to a different partition? If so, how are we going to handle the update chain links? Bah, I didn't mention it? My current thinking is that it would be disallowed; if you have chosen your partitioning key well enough it shouldn't be necessary. As a workaround you can always DELETE/INSERT. Maybe we can allow it later, but for a first cut this seems more than good enough. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: One other interesting thought that occurs to me: are we going to support UPDATEs that cause a row to belong to a different partition? If so, how are we going to handle the update chain links? Bah, I didn't mention it? My current thinking is that it would be disallowed; if you have chosen your partitioning key well enough it shouldn't be necessary. As a workaround you can always DELETE/INSERT. Maybe we can allow it later, but for a first cut this seems more than good enough. Hm. I certainly agree that it's a case that could be disallowed for a first cut, but it'd be nice to have some clue about how we might allow it eventually. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 2014-08-29 13:15:16 -0400, Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: One other interesting thought that occurs to me: are we going to support UPDATEs that cause a row to belong to a different partition? If so, how are we going to handle the update chain links? Bah, I didn't mention it? My current thinking is that it would be disallowed; if you have chosen your partitioning key well enough it shouldn't be necessary. As a workaround you can always DELETE/INSERT. Maybe we can allow it later, but for a first cut this seems more than good enough. Hm. I certainly agree that it's a case that could be disallowed for a first cut, but it'd be nice to have some clue about how we might allow it eventually. Not pretty, but we could set t_ctid to some 'magic' value when switching partitions. Everything chasing ctid chains could then error out when hitting a invisible row with such a t_ctid. The usecases for doing such updates really are more maintenance style commands, so it's possibly not too bad from a usability POV :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Inverse of pg_get_serial_sequence?
Hi, We have pg_get_serial_sequence() mapping (relation, colum) to the sequence. What I'm missing right now is the inverse. I.e. given a sequence tell me the owner. describe.c has a query for that, and it's not too hard to write, but it still seems 'unfriendly' not to provide it. Does anybody dislike adding a function for that? I can't really think of a good name (not that pg_get_serial_sequence is well named). pg_get_serial_sequence_owner(serial regclass, OUT rel regclass, OUT colname name) maybe? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Andres Freund and...@2ndquadrant.com writes: On 2014-08-29 13:15:16 -0400, Tom Lane wrote: Hm. I certainly agree that it's a case that could be disallowed for a first cut, but it'd be nice to have some clue about how we might allow it eventually. Not pretty, but we could set t_ctid to some 'magic' value when switching partitions. Everything chasing ctid chains could then error out when hitting a invisible row with such a t_ctid. An actual fix would presumably involve adding a partition number to the ctid chain field in tuples in partitioned tables. The reason I bring it up now is that we'd have to commit to doing that (or at least leaving room for it) in the first implementation, if we don't want to have an on-disk compatibility break. There is certainly room to argue that the value of this capability isn't worth the disk space this solution would eat. But we should have that argument while the option is still feasible ... The usecases for doing such updates really are more maintenance style commands, so it's possibly not too bad from a usability POV :( I'm afraid that might just be wishful thinking. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On Wed, Aug 27, 2014 at 6:51 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I'd like to follow this direction, and start stripping the DDL support. ...please make it so. The attached patch eliminates DDL support. Instead of the new CREATE CUSTOM PLAN PROVIDER statement, it adds an internal function; register_custom_scan_provider that takes custom plan provider name and callback function to add alternative scan path (should have a form of CustomPath) during the query planner is finding out the cheapest path to scan the target relation. Also, documentation stuff is revised according to the latest design. Any other stuff keeps the previous design. Comments: 1. There seems to be no reason for custom plan nodes to have MultiExec support; I think this as an area where extensibility is extremely unlikely to work out. The MultiExec mechanism is really only viable between closely-cooperating nodes, like Hash and HashJoin, or BitmapIndexScan, BitmapAnd, BitmapOr, and BitmapHeapScan; and arguably those things could have been written as a single, more complex node. Are we really going to want to support a custom plan that can substitute for a Hash or BitmapAnd node? I really doubt that's very useful. 2. This patch is still sort of on the fence about whether we're implementing custom plans (of any type) or custom scans (thus, of some particular relation). I previously recommended that we confine ourselves initially to the task of adding custom *scans* and leave the question of other kinds of custom plan nodes to a future patch. After studying the latest patch, I'm inclined to suggest a slightly revised strategy. This patch is really adding THREE kinds of custom objects: CustomPlanState, CustomPlan, and CustomPath. CustomPlanState inherits from ScanState, so it is not really a generic CustomPlan, but specifically a CustomScan; likewise, CustomPlan inherits from Scan, and is therefore a CustomScan, not a CustomPlan. But CustomPath is different: it's just a Path. Even if we only have the hooks to inject CustomPaths that are effectively scans at this point, I think that part of the infrastructure could be somewhat generic. Perhaps eventually we have CustomPath which can generate either CustomScan or CustomJoin which in turn could generate CustomScanState and CustomJoinState. For now, I propose that we rename CustomPlan and CustomPlanState to CustomScan and CustomScanState, because that's what they are; but that we leave CustomPath as-is. For ease of review, I also suggest splitting this into a series of three patches: (1) add support for CustomPath; (2) add support for CustomScan and CustomScanState; (3) ctidscan. 3. Is it really a good idea to invoke custom scan providers for RTEs of every type? It's pretty hard to imagine that a custom scan provider can do anything useful with, say, RTE_VALUES. Maybe an accelerated scan of RTE_CTE or RTE_SUBQUERY is practical somehow, but even that feels like an awfully big stretch. At least until clear use cases emerge, I'd be inclined to restrict this to RTE_RELATION scans where rte-relkind != RELKIND_FOREIGN_TABLE; that is, put the logic in set_plain_rel_pathlist() rather than set_rel_pathlist(). (We might even want to consider whether the hook in set_plain_rel_pathlist() ought to be allowed to inject a non-custom plan; e.g. substitute a scan of relation B for a scan of relation A. For example, imagine that B contains all rows from A that satisfy some predicate. This could even be useful for foreign tables; e.g. substitute a scan of a local copy of a foreign table for a reference to that table. But I put all of these ideas in parentheses because they're only good ideas to the extent that they don't sidetrack us too much.) 4. Department of minor nitpicks. You've got a random 'xs' in the comments for ExecSupportsBackwardScan. And, in contrib/ctidscan, ctidscan_path_methods, ctidscan_plan_methods, and ctidscan_exec_methods can have static initializers; there's no need to initialize them at run time in _PG_init(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: One other interesting thought that occurs to me: are we going to support UPDATEs that cause a row to belong to a different partition? If so, how are we going to handle the update chain links? Bah, I didn't mention it? My current thinking is that it would be disallowed; if you have chosen your partitioning key well enough it shouldn't be necessary. As a workaround you can always DELETE/INSERT. Maybe we can allow it later, but for a first cut this seems more than good enough. Hm. I certainly agree that it's a case that could be disallowed for a first cut, but it'd be nice to have some clue about how we might allow it eventually. I hesitate to suggest this, but we have free flag bits in MultiXactStatus. We could use a specially marked multixact member to indicate the OID of the target relation; perhaps set an infomask bit to indicate that this has happened. Of course, no HOT updates are possible so I think it's okay from a heap_prune_chain perspective. This abuses the knowledge that OIDs and XIDs are both 32 bits long. Since nowhere else we have the space necessary to store the longer data that a cross-partition update would require, I don't see anything else ATM. (For a moment I thought about abusing combo CIDs, but that doesn't work because this requires to be persistent and visible from other backends, neither of which is a quality of combocids.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 2014-08-29 13:29:19 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-29 13:15:16 -0400, Tom Lane wrote: Hm. I certainly agree that it's a case that could be disallowed for a first cut, but it'd be nice to have some clue about how we might allow it eventually. Not pretty, but we could set t_ctid to some 'magic' value when switching partitions. Everything chasing ctid chains could then error out when hitting a invisible row with such a t_ctid. An actual fix would presumably involve adding a partition number to the ctid chain field in tuples in partitioned tables. The reason I bring it up now is that we'd have to commit to doing that (or at least leaving room for it) in the first implementation, if we don't want to have an on-disk compatibility break. Right. Just adding it unconditionally doesn't sound feasible to me. Our per-row overhead is already too large. And it doesn't sound fun to have the first-class partitions use a different heap tuple format than plain relations. What we could do is to add some sort of 'jump' tuple when moving a tuple from one relation to another. So, when updating a tuple between partitions we add another in the old partition with xmin_jump = xmax_jump = xmax_old and have the jump tuple's content point to the new relation. Far from pretty, but it'd only matter overhead wise when used. The usecases for doing such updates really are more maintenance style commands, so it's possibly not too bad from a usability POV :( I'm afraid that might just be wishful thinking. I admit that you might very well be right there :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
Hello Heikki, [...] I would be fine with both. After giving it some thought, ISTM better to choose consistency over intuition, and have latency under throttling always defined wrt the scheduled start time and not the actual start time, even if having a latency of 1 ms for an OLTP load might seem surprising to some. The other one can be computed by substracting the average lag time. I attached a v6 which is a consolidate patch of v5 + the small update for the latency definition. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 2f7d80e..40427a3 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -141,6 +141,18 @@ double sample_rate = 0.0; int64 throttle_delay = 0; /* + * Transactions which take longer that this limit are counted as late + * and reported as such, although they are completed anyway. + * + * When under throttling: execution time slots which are more than + * this late (in us) are simply skipped, and the corresponding transaction + * is counted as such... it is not even started; + * otherwise above the limit transactions are counted as such, with the latency + * measured wrt the transaction schedule, not its actual start. + */ +int64 latency_limit = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -238,6 +250,8 @@ typedef struct int64 throttle_trigger; /* previous/next throttling (us) */ int64 throttle_lag; /* total transaction lag behind throttling */ int64 throttle_lag_max; /* max transaction lag */ + int64 throttle_latency_skipped; /* lagging transactions skipped */ + int64 latency_late; /* late transactions */ } TState; #define INVALID_THREAD ((pthread_t) 0) @@ -250,6 +264,8 @@ typedef struct int64 sqlats; int64 throttle_lag; int64 throttle_lag_max; + int64 throttle_latency_skipped; + int64 latency_late; } TResult; /* @@ -367,6 +383,10 @@ usage(void) -f, --file=FILENAME read transaction script from FILENAME\n -j, --jobs=NUM number of threads (default: 1)\n -l, --logwrite transaction times to log file\n + -L, --limit=NUM count transactions lasting more than NUM ms.\n + under throttling (--rate), transactions behind schedule\n + more than NUM ms are skipped, and those finishing more\n + than NUM ms after their scheduled start are counted.\n -M, --protocol=simple|extended|prepared\n protocol for submitting queries (default: simple)\n -n, --no-vacuum do not run VACUUM before tests\n @@ -1016,6 +1036,24 @@ top: thread-throttle_trigger += wait; + if (latency_limit) + { + instr_time now; + int64 now_us; + INSTR_TIME_SET_CURRENT(now); + now_us = INSTR_TIME_GET_MICROSEC(now); + while (thread-throttle_trigger now_us - latency_limit) + { +/* if too far behind, this slot is skipped, and we + * iterate till the next nearly on time slot. + */ +int64 wait = (int64) (throttle_delay * + 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); +thread-throttle_trigger += wait; +thread-throttle_latency_skipped ++; + } + } + st-until = thread-throttle_trigger; st-sleeping = 1; st-throttling = true; @@ -1080,15 +1118,31 @@ top: thread-exec_count[cnum]++; } - /* transaction finished: record latency under progress or throttling */ - if ((progress || throttle_delay) commands[st-state + 1] == NULL) + /* transaction finished: record latency under progress or throttling, + * ot if latency limit is set + */ + if ((progress || throttle_delay || latency_limit) + commands[st-state + 1] == NULL) { instr_time diff; int64 latency; INSTR_TIME_SET_CURRENT(diff); - INSTR_TIME_SUBTRACT(diff, st-txn_begin); - latency = INSTR_TIME_GET_MICROSEC(diff); + + if (throttle_delay) + { +/* Under throttling, compute latency wrt to the initial schedule, + * not the actual transaction start. + */ +int64 now = INSTR_TIME_GET_MICROSEC(diff); +latency = now - thread-throttle_trigger; + } + else + { +INSTR_TIME_SUBTRACT(diff, st-txn_begin); +latency = INSTR_TIME_GET_MICROSEC(diff); + } + st-txn_latencies += latency; /* @@ -1099,6 +1153,11 @@ top: * would take 256 hours. */ st-txn_sqlats += latency * latency; + + /* record over the limit transactions if needed. + */ + if (latency_limit latency latency_limit) +thread-latency_late++; } /* @@ -1294,7 +1353,7 @@ top: } /* Record transaction start time under logging, progress or throttling */ - if ((logfile || progress || throttle_delay) st-state == 0) + if ((logfile || progress || throttle_delay || latency_limit) st-state == 0) INSTR_TIME_SET_CURRENT(st-txn_begin); /* Record statement start time if per-command latencies are requested */ @@
[HACKERS] clang warning on master
Hi I see a lot of warnings [pavel@localhost postgresql]$ make all | grep warning exec.c:280:2: warning: implicitly declaring library function 'strlcpy' with type 'unsigned long (char *, const char *, unsigned long)' strlcpy(link_buf, fname, sizeof(link_buf)); ^ exec.c:280:2: note: please include the header string.h or explicitly provide a declaration for 'strlcpy' 1 warning generated. exec.c:280:2: warning: implicitly declaring library function 'strlcpy' with type 'unsigned long (char *, const char *, unsigned long)' strlcpy(link_buf, fname, sizeof(link_buf)); ^ exec.c:280:2: note: please include the header string.h or explicitly provide a declaration for 'strlcpy' 1 warning generated. path.c:188:3: warning: implicitly declaring library function 'strlcpy' with type 'unsigned long (char *, const char *, unsigned long)' strlcpy(ret_path, head, MAXPGPATH); ^ path.c:188:3: note: please include the header string.h or explicitly provide a declaration for 'strlcpy' 1 warning generated. path.c:188:3: warning: implicitly declaring library function 'strlcpy' with type 'unsigned long (char *, const char *, unsigned long)' strlcpy(ret_path, head, MAXPGPATH); ^ path.c:188:3: note: please include the header string.h or explicitly provide a declaration for 'strlcpy' 1 warning generated. thread.c:77:2: warning: implicitly declaring library function 'strlcpy' with type 'unsigned long (char *, const char *, unsigned long)' strlcpy(strerrbuf, strerror(errnum), buflen); ^ thread.c:77:2: note: please include the header string.h or explicitly provide a declaration for 'strlcpy' 1 warning generated. pgtz.c:53:2: warning: implicitly declaring library function 'strlcpy' with type 'unsigned long (char *, const char *, unsigned long)' strlcpy(tzdir + strlen(tzdir), /timezone, MAXPGPATH - strlen(tzdir)); ^ pgtz.c:53:2: note: please include the header string.h or explicitly provide a declaration for 'strlcpy' 1 warning generated. xlog.c:5627:4: warning: implicitly declaring library function 'strlcpy' with type 'unsigned long (char *, const char *, unsigned long)' strlcpy(recoveryStopName, recordRestorePointData-rp_name, MAXFNAMELEN); Regards Pavel
Re: [HACKERS] On partitioning
Andres Freund and...@2ndquadrant.com writes: On 2014-08-29 13:29:19 -0400, Tom Lane wrote: An actual fix would presumably involve adding a partition number to the ctid chain field in tuples in partitioned tables. The reason I bring it up now is that we'd have to commit to doing that (or at least leaving room for it) in the first implementation, if we don't want to have an on-disk compatibility break. What we could do is to add some sort of 'jump' tuple when moving a tuple from one relation to another. So, when updating a tuple between partitions we add another in the old partition with xmin_jump = xmax_jump = xmax_old and have the jump tuple's content point to the new relation. Hm, that might work. It sounds more feasible than Alvaro's suggestion of abusing cmax --- I don't think that field is free for use in this context. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] clang warning on master
Pavel Stehule pavel.steh...@gmail.com writes: I see a lot of warnings [pavel@localhost postgresql]$ make all | grep warning exec.c:280:2: warning: implicitly declaring library function 'strlcpy' with type 'unsigned long (char *, const char *, unsigned long)' strlcpy(link_buf, fname, sizeof(link_buf)); ^ exec.c:280:2: note: please include the header string.h or explicitly provide a declaration for 'strlcpy' 1 warning generated. [ raised eyebrow... ] exec.c certainly includes string.h already (via c.h). I think you should complain to your compiler vendor. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 08/29/2014 07:15 PM, Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: One other interesting thought that occurs to me: are we going to support UPDATEs that cause a row to belong to a different partition? If so, how are we going to handle the update chain links? Bah, I didn't mention it? My current thinking is that it would be disallowed; if you have chosen your partitioning key well enough it shouldn't be necessary. As a workaround you can always DELETE/INSERT. Maybe we can allow it later, but for a first cut this seems more than good enough. Hm. I certainly agree that it's a case that could be disallowed for a first cut, but it'd be nice to have some clue about how we might allow it eventually. There needs to be some structure that is specific to partitions and not multiple plain tables which would then be used for both update chains and cross-partition indexes (as you seem to imply by jumping from indexes to update chains a few posts back). It would need to replace plain tid (pagenr, tupnr) with triple of (partid, pagenr, tupnr). Cross-partition indexes are especially needed if we want to allow putting UNIQUE constraints on non-partition-key columns. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Built-in binning functions
Hi I am looking to your last patch and I have a few questions, notes 1. I am thinking so reduction to only numeric types is not necessary - although we can live without it - but there are lot of non numeric categories: chars, date, ... 2. Still I strongly afraid about used searching method - there is not possible to check a validity of input. Did you check how much linear searching is slower - you spoke, so you have a real data and real use case? Used methods can returns wrong result without any warning, what is acceptable for extensions, but I am not sure, for core feature. 3. I am thinking about name - I don't think so varwidth_bucket is correct -- in relation to name width_bucket ... what about range_bucket or scope_bucket ? last patch is very simple, there are no new compilation or regress tests issues. Regards Pavel 2014-08-25 16:15 GMT+02:00 Petr Jelinek p...@2ndquadrant.com: Hi, I finally had some time to get back to this. I attached version3 of the patch which fixes Tom's complaint about int8 version by removing the int8 version as it does not seem necessary (the float8 can handle integers just fine). This patch now basically has just one optimized function for float8 and one for numeric datatypes, just like width_bucket. On 08/07/14 02:14, Tom Lane wrote: Also, I'm not convinced by this business of throwing an error for a NULL array element. Per spec, null arguments to width_bucket() produce a null result, not an error --- shouldn't this flavor act similarly? In any case I think the test needs to use array_contains_nulls() not just ARR_HASNULL. I really think there should be difference between NULL array and NULL inside array and that NULL inside the array is wrong input. I changed the check to array_contains_nulls() though. On 08/07/14 02:14, Tom Lane wrote: Lastly, the spec defines behaviors for width_bucket that allow either ascending or descending buckets. We could possibly do something similar I am not sure it's worth it here as we require input to be sorted anyway. It might be worthwhile if we decided to do this as an aggregate (since there input would not be required to be presorted) instead of function but I am not sure if that's desirable either as that would limit usability to only the single main use-case (group by and count()). On 20/07/14 11:01, Simon Riggs wrote: On 16 July 2014 20:35, Pavel Stehule pavel.steh...@gmail.com wrote: On 08/07/14 02:14, Tom Lane wrote: I didn't see any discussion of the naming question in this thread. I'd like to propose that it should be just width_bucket(); we can easily determine which function is meant, considering that the SQL-spec variants don't take arrays and don't even have the same number of actual arguments. I did mention in submission that the names are up for discussion, I am all for naming it just width_bucket. I had this idea too - but I am not sure if it is good idea. A distance between ANSI SQL with_bucket and our enhancing is larger than in our implementation of median for example. I can live with both names, but current name I prefer. So I suggest that we use the more generic function name bin(), with a second choice of discretize() (though that seems annoyingly easy to spell incorrectly) I really don't think bin() is good choice here, bucket has same meaning in this context and bin is often used as shorthand for binary which would be confusing here. Anyway I currently left the name as it is, I would not be against using width_bucket() or even just bucket(), not sure about the discretize() option. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 29.8.2014 16:12, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: I have a new approach to the patch which is to call a callback at each block allocation and child contexts inherit the callback from their parents. The callback could be defined to simply dereference and increment its argument, which would mean block allocations anywhere in the hierarchy are incrementing the same variable, avoiding the need to traverse the hierarchy. Cute, but it's impossible to believe that a function call isn't going to be *even more* overhead than what you've got now. This could only measure out as a win when the feature is going unused. What about removing the callback per se and just keeping the argument, as it were. That is, a MemoryContext has a field of type size_t * that is normally NULL, but if set to non-NULL, then we increment the pointed-to value for pallocs and decrement for pfrees. How is that going to work with two memory contexts (parent/child), both requesting accounting? If I understand the proposal correctly, the child will either inherit pointer to the field in parent (and then won't get accounting for it's own memory), or wil get a private field (and thus won't update the accounting of the parent context). Or am I missing something? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 08/29/2014 05:56 PM, Alvaro Herrera wrote: Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja reference Tom's post http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us which mentions the possibility of a different partitioning implementation than what we have so far. As it turns out, I've been thinking about partitioning recently, so I thought I would share what I'm thinking so that others can poke holes. My intention is to try to implement this as soon as possible. Declarative partitioning ... Still To Be Designed * Dependency issues * Are indexes/constraints inherited from the parent rel? I'd say mostly yes. There could some extra constraint exclusion type magic for conditional indexes, but the rest probably should come from main table And there should be some kind of cross-partition indexes. At partitioning capability, this can probably wait for version 2. * Multiple keys? Why not. But probably just for hash partitioning. Subpartitioning? Probably not. If you need speed for huge numbers of partitions, use Gregs idea of keeping the partitions in a tree (or just having a partition index). Hash partitioning? At some point definitely. Also one thing you left unmentioned is dropping (and perhaps also truncating) a partition. We still may want to do historic data management the same way we do it now, by just getting rid of the whole partition or its data. At some point we may also want to do redistributing data between partitions, maybe for case where we end up with 90% of the data in on partition due to bad partitioning key or partitioning function choice. This is again something that is hard now and can therefore be left to a later version. Open Questions -- * What's the syntax to refer to specific partitions within a partitioned table? We could do TABLE xyz PARTITION n, but for example if in the future we add hash partitioning, we might need some non-integer addressing (OTOH assigning sequential numbers to hash partitions doesn't seem so bad). Discussing with users of other DBMSs partitioning feature, one useful phrase is TABLE xyz PARTITION FOR value. Or more generally TABLE xyz PARTITION FOR/WHERE col1=val1, col2=val2, ...; Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [TODO] Track number of files ready to be archived in pg_stat_archiver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Le 28/08/2014 05:58, Michael Paquier a écrit : On Thu, Aug 28, 2014 at 7:37 AM, Julien Rouhaud julien.rouh...@dalibo.com wrote: Attached v2 patch implements this approach. All the work is still done in pg_stat_get_archiver, as I don't think that having a specific function for that information would be really interesting. Please be sure to add that to the next commit fest. This is a feature most welcome within this system view. Regards, I just added it. Thanks. - -- Julien Rouhaud http://dalibo.com - http://dalibo.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) iQEcBAEBAgAGBQJUAMv6AAoJELGaJ8vfEpOqZgIIAKNp0a4XaZNRtEw3+yZogxLD RIpSnURh1COEZs5UUkdsuybvLqOqZXbCQWfK9+3B3pqoYD9LTIzlg4jcArOcbqgd Fe43BEH4QYabjdS1DWGSzop9E0NY/Vg82ZGzyHzGYQKI1k9Y/pEeM5q74vRN3aH0 RbUbcnN0ajCMswLbjfc/nDXNCDAr96peLZoI1l2lW7fJIElkXJz/I28fNAHtj7Dg hxmBXf8uVZ7g+pCVIhLodFm4mp4ZB0ZvTHxDHCXU9wH/p7otDD4GV0Cml9DlSfE6 cFm0CXfeMHawaihz6bs8Z1Zxntdh7Qy+lAHmBRuXZUwzaJYTDxwL/YCvnSsVE9o= =kD4R -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Hannu Krosing wrote: Cross-partition indexes are especially needed if we want to allow putting UNIQUE constraints on non-partition-key columns. I'm not going to implement cross-partition indexes in the first patch. They are a huge can of worms. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Fri, Aug 29, 2014 at 11:56 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: In this design, partitions are first-class objects, not normal tables in inheritance hierarchies. There are no pg_inherits entries involved at all. Whoa. I always assumed that table inheritance was a stepping-stone to real partitioning, and that real partitioning would be built on top of table inheritance. In particular, I assume that (as Itagaki Takahiro's patch did those many years ago) we'd add some metadata somewhere to allow fast tuple routing (for both pruning and inserts/updates). What's the benefit of inventing something new instead? I'm skeptical about your claim that there will be no pg_inherits entries involved at all. You need some way to know which partitions go with which parent table. You can store that many-to-one mapping someplace other than pg_inherits, but it seems to me that that doesn't buy you anything; they're just pg_inherits entries under some other name. Why reinvent that? Each partition is assigned an Expression that receives a tuple and returns boolean. This expression returns true if a given tuple belongs into it, false otherwise. If a tuple for a partitioned relation is run through expressions of all partitions, exactly one should return true. If none returns true, it might be because the partition has not been created yet. A user-facing error is raised in this case (Rationale: if user creates a partitioned rel and there is no partition that accepts some given tuple, it's the user's fault.) Additionally, each partitioned relation may have a master expression. This receives a tuple and returns an integer, which corresponds to the number of the partition it belongs into. I agree with Tom: this is a bad design. In particular, if we want to scale to large numbers of partitions (a principal weakness of the present system) we need the operation of routing a tuple to a partition to be as efficient as possible. Range partitioning can be O(lg n) where n is the number of partitions: store a list of the boundaries and binary-search it. List partitioning can be O(lg k) where k is the number of values (which may be more than the number of partitions) via a similar technique. Hash partitioning can be O(1). I'm not sure what other kind of partitioning anybody would want to do, but it's likely that they *won't* want it to be O(1) in the number of partitions. So I'd say have *only* the master expression. But, really, I don't think an expression is the right way to store this; evaluating that repeatedly will, I think, still be too slow. Think about what happens in PL/pgsql: minimizing the number of times that you enter and exit the executor helps performance enormously, even if the expressions are simple enough not to need planning. I think the representation should be more like an array of partition boundaries and the pg_proc OID of a comparator. Per-partition expressions are formed as each partition is created, and are based on the user-supplied partitioning criterion. Master expressions are formed at relation creation time. (XXX Can we change the master expression later, as a result of some ALTER command? Presumably this would mean that all partitions might need to be rewritten.) This is another really important point. If you store an opaque expression mapping partitioning keys to partition numbers, you can't do things like this efficiently. With a more transparent representation, like a sorted array of partition boundaries for range partitioning, or a sorted array of hash values for consistent hashing, you can do things like split and merge partitions efficiently, minimizing rewriting. Planner --- A partitioned relation behaves just like a regular relation for purposes of planner. XXX do we need special considerations regarding relation size estimation? For scan plans, we need to prepare Append lists which are used to scan for tuples in a partitioned relation. We can setup fake constraint expressions based on the partitioning expressions, which let the planner discard unnecessary partitions by way of constraint exclusion. So if we're going to do all this, why bother making the partitions anything other than inheritance children? There might be some benefit in having the partitions be some kind of stripped-down object if we could avoid some of these planner gymnastics and get, e.g. efficient run-time partition pruning. But if you're going to generate Append plans and switch ResultRelInfos and stuff just as you would for an inheritance hierarchy, why not just make it an inheritance hierarchy? It seems pretty clear to me that we need partitioned tables to have the same tuple descriptor throughout the relation, for efficient tuple routing and so on. But the other restrictions you're proposing to impose on partitions have no obvious value that I can see. We could have a rule that when you inherit from a partition root, you can only inherit from
Re: [HACKERS] alter user set local_preload_libraries.
On 8/28/14 9:01 AM, Kyotaro HORIGUCHI wrote: I found this issue when trying per-pg_user (role) loading of auto_analyze and some tweaking tool. It is not necessarily set by the user by own, but the function to decide whether to load some module by the session-user would be usable, at least, as for me:) I think we could just set local_preload_libraries to PGC_USERSET and document that subsequent changes won't take effect. That's the same way session_preload_libraries works. That would avoid inventing another very specialized GUC context. If you're interested in improving this area, I also suggest you read the thread of http://www.postgresql.org/message-id/1349829917.29682.5.ca...@vanquo.pezone.net. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On 8/27/14 2:55 AM, Magnus Hagander wrote: I think the easy way of doing that is to just create an xlog.tar file. Since we already create base.tar and possibly n*tablespace.tar, adding one more file shouldn't be a big problem, and would make such an implementation much easier. Would be trivial to do .tar.gz for it as well, just like for the others. That might be a way forward, but for someone who doesn't use tablespaces and just wants and all-on-one backup, this change would make that more cumbersome, because now you'd always have more than one file to deal with. It might be worth considering a mode that combines all those tar files into a super-tar. I'm personally not a user of the tar mode, so I don't know what a typical use would be, though. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On Fri, Aug 29, 2014 at 10:34 PM, Peter Eisentraut pete...@gmx.net wrote: On 8/27/14 2:55 AM, Magnus Hagander wrote: I think the easy way of doing that is to just create an xlog.tar file. Since we already create base.tar and possibly n*tablespace.tar, adding one more file shouldn't be a big problem, and would make such an implementation much easier. Would be trivial to do .tar.gz for it as well, just like for the others. That might be a way forward, but for someone who doesn't use tablespaces and just wants and all-on-one backup, this change would make that more cumbersome, because now you'd always have more than one file to deal with. It would in stream mode, which doesn't work at all. I do agree with Roberts suggestion that we shouldn't remove file mode right away - but we should change the default. It might be worth considering a mode that combines all those tar files into a super-tar. I'm personally not a user of the tar mode, so I don't know what a typical use would be, though. That would probably be useful, though a lot more difficult when you consider two separate processes writing into the same tarfile. But I agree that the format for single tablespace just gimme a bloody tarfile is quite incovenient today, in that you need a directory and we drop a base.tar in there. We should perhaps try to find a more convenient way for that specific usecase, since it probably represents the majority of users. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Why data of timestamptz does not store value of timezone passed to it?
On 8/29/14 11:27 AM, Greg Stark wrote: You know... I wonder if we have enough leverage in the standards committee these days that we could usefully push that direction instead of being pushed around. The standard timestamp with time zone is not very useful and I'm sure the standards committee wouldn't mind having a useful point-in-time data type. Not likely unless Oracle or IBM have an existing implementation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] delta relations in AFTER triggers
Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/28/2014 12:03 AM, Kevin Grittner wrote: Heikki Linnakangas hlinnakan...@vmware.com wrote: I suggest adding a new hook to the ParseState struct, (p_rangevar_hook ?). The planner calls it whenever it sees a reference to a table, and the hook function returns back some sort of placeholder reference to the tuplestore. With variables, the hook returns a Param node, and at execution time, the executor calls the paramFetch hook to fetch the value of the param. For relations/tuplestores, I guess we'll need to invent something like a Param node, but for holding information about the relation. Like your TsrData struct, but without the pointer to the tuplestore. At execution time, in the SPI_execute call, you pass the pointer to the tuplestore in the ParamListInfo struct, like you pass parameter values. Does this make sense? I see your point, but SPI first has to be made aware of the tuplestores and their corresponding names and TupleDesc structures. Does it make sense to keep the SPI_register_tuplestore() and SPI_unregister_tuplestore() functions for the client side of the API, and pass things along to the parse analysis through execution phases using the techniques you describe? Sorry, I didn't understand that. What do you mean by first, and the client side of the API? I don't see any need for the SPI_register_tuplestore() and and SPI_unregister_tuplestore() functions if you use the hooks. If we were to go with the hooks as you propose, we would still need to take the information from TriggerData and put it somewhere else for the hook to reference. The hooks are generalized for plpgsql, not just for triggers, and it doesn't seem appropriate for them to be fishing around in the TriggerData structure. And what if we add other sources for tuplestores? The lookup during parse analysis each time an apparent relation name is encountered must be simple and fast. I want named tuplestores to be easy to use from *all* PLs (for trigger usage) as well as useful for other purposes people may want to develop. I had to change the hashkey for plpgsql's plan caching, but that needs to be done regardless of the API (to prevent problems in the obscure case that someone attaches the same trigger function to the same table for the same events more than once with different trigger names and different transition table names). If you ignore that, the *entire* change to use this in plpgsql is to add these lines to plpgsql_exec_trigger(): /* * Capture the NEW and OLD transition TABLE tuplestores (if specified for * this trigger). */ if (trigdata-tg_newtable) { Tsr tsr = palloc(sizeof(TsrData)); tsr-name = trigdata-tg_trigger-tgnewtable; tsr-tstate = trigdata-tg_newtable; tsr-tupdesc = trigdata-tg_relation-rd_att; tsr-relid = trigdata-tg_relation-rd_id; SPI_register_tuplestore(tsr); } if (trigdata-tg_oldtable) { Tsr tsr = palloc(sizeof(TsrData)); tsr-name = trigdata-tg_trigger-tgoldtable; tsr-tstate = trigdata-tg_oldtable; tsr-tupdesc = trigdata-tg_relation-rd_att; tsr-relid = trigdata-tg_relation-rd_id; SPI_register_tuplestore(tsr); } With the new SPI functions, the code to implement this in each other PL should be about the same (possibly identical), and areas using SPI only need similar code to make tuplestores visible to the planner and usable in the executor if someone has another use for this. You just do the above once you have run SPI_connect() and before preparing or executing any query that references the named tuplestore. It remains available on that SPI connection until SPI_finish() is called or you explicitly unregister it (by name). If we use the hooks, I think it will be several times as much code, more invasive, and probably more fragile. More importantly, these hooks are not used by the other PLs included with core, and are not used in any of the other core code, anywhere. They all use SPI, so they can do the above very easily, but adding infrastructure for them to use the hooks would be a lot more work, and I'm not seeing a corresponding benefit. I think there is some room to change the API as used by the parser, planner, and executor so that no changes are needed there if we add some other registration mechanism besides SPI, but I think having a registry for tuplestores that sort of takes the place of the catalogs and related caches (but is far simpler, being process local) is a better model than what you propose. In summary, thinking of the definition of a named tuplestore as a variable is the wrong parallel; it is more like a lightweight relation, and the comparison should be to that, not to function parameters or local variables. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list
Re: [HACKERS] Why data of timestamptz does not store value of timezone passed to it?
Hi Craig, On Fri, Aug 29, 2014 at 10:17:17AM +0800, Craig Ringer wrote: (...) It should also discuss the approach of storing a (instant timestamptz, timezone text) or (instant timestampts, tzoffset smallint) tuple for when unambiguous representation is required. (I guess I just volunteered myself to write a draft of that). Please notice that smallint is too small for tzoffset: SELECT d AT TIME ZONE 'Europe/Berlin' - d AT TIME ZONE 'Europe/Paris' FROM ( VALUES (date '1815-10-31') , (date '1897-02-19') ) AS f(d); Cheers, Dr. Gianni Ciolli - 2ndQuadrant Italia PostgreSQL Training, Services and Support gianni.cio...@2ndquadrant.it | www.2ndquadrant.it -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins
On 2014-08-26 22:04:03 -0400, Robert Haas wrote: That's all I see on a first-read through. I think I fixed all of them. Thanks. There might be other issues, and I haven't checked through it in great detail for mundane bugs, but generally, I favor pressing on relatively rapidly toward a commit. It seems highly likely that this idea is a big win, and if there's some situation in which it's a loss, we're more likely to find out with it in the tree (and thus likely to be tested by many more people) than by analysis from first principles. Attached is the version I plan to commit after going over it again tomorrow. Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From b48a28eee3518548740525243d1a165720a67231 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Fri, 29 Aug 2014 23:31:37 +0200 Subject: [PATCH] Make backend local tracking of buffer pins memory efficient. Since the dawn of time (aka Postgres95) multiple pins of the same buffer by one backend have been optimized not to modify the shared refcount more than once. This optimization has always used a NBuffer sized array in each backend keeping track of a backend's pins. That array (PrivateRefCount) was one of the biggest per-backend memory allocations, depending on the shared_buffers setting. Besides the waste of memory it also has proven to be a performance bottleneck when assertions are enabled as we make sure that there's no remaining pins left at the end of transactions. Also, on servers with lots of memory and a correspondingly high shared_buffers setting the amount of random memory accesses can also lead to poor cpu cache efficiency. Because of these reasons a backend's buffers pins are now kept track of in a small statically sized array that overflows into a hash table when necessary. Benchmarks have shown neutral to positive performance results with considerably lower memory usage. Patch by me, review by Robert Haas. --- contrib/pg_buffercache/pg_buffercache_pages.c | 2 +- src/backend/storage/buffer/buf_init.c | 39 +-- src/backend/storage/buffer/bufmgr.c | 418 -- src/include/storage/bufmgr.h | 19 -- 4 files changed, 391 insertions(+), 87 deletions(-) diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index b1be98f..d00f985 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -37,7 +37,7 @@ typedef struct /* * An int32 is sufficiently large, as MAX_BACKENDS prevents a buffer from * being pinned by too many backends and each backend will only pin once - * because of bufmgr.c's PrivateRefCount array. + * because of bufmgr.c's PrivateRefCount infrastructure. */ int32 pinning_backends; } BufferCachePagesRec; diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index e03394c..ff6c713 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -20,7 +20,6 @@ BufferDesc *BufferDescriptors; char *BufferBlocks; -int32 *PrivateRefCount; /* @@ -50,16 +49,9 @@ int32 *PrivateRefCount; * * refcount -- Counts the number of processes holding pins on a buffer. * A buffer is pinned during IO and immediately after a BufferAlloc(). - * Pins must be released before end of transaction. - * - * PrivateRefCount -- Each buffer also has a private refcount that keeps - * track of the number of times the buffer is pinned in the current - * process.This is used for two purposes: first, if we pin a - * a buffer more than once, we only need to change the shared refcount - * once, thus only lock the shared state once; second, when a transaction - * aborts, it should only unpin the buffers exactly the number of times it - * has pinned them, so that it will not blow away buffers of another - * backend. + * Pins must be released before end of transaction. For efficiency the + * shared refcount isn't increased if a individual backend pins a buffer + * multiple times. Check the PrivateRefCount infrastructure in bufmgr.c. */ @@ -130,31 +122,6 @@ InitBufferPool(void) } /* - * Initialize access to shared buffer pool - * - * This is called during backend startup (whether standalone or under the - * postmaster). It sets up for this backend's access to the already-existing - * buffer pool. - * - * NB: this is called before InitProcess(), so we do not have a PGPROC and - * cannot do LWLockAcquire; hence we can't actually access stuff in - * shared memory yet. We are only initializing local data here. - * (See also InitBufferPoolBackend, over in bufmgr.c.) - */ -void -InitBufferPoolAccess(void) -{ - /* - * Allocate and zero local arrays of per-buffer info. - */ - PrivateRefCount = (int32 *) calloc(NBuffers, sizeof(int32)); - if
Re: [HACKERS] Inverse of pg_get_serial_sequence?
Andres Freund-3 wrote Hi, We have pg_get_serial_sequence() mapping (relation, colum) to the sequence. What I'm missing right now is the inverse. I.e. given a sequence tell me the owner. describe.c has a query for that, and it's not too hard to write, but it still seems 'unfriendly' not to provide it. Does anybody dislike adding a function for that? I can't really think of a good name (not that pg_get_serial_sequence is well named). pg_get_serial_sequence_owner(serial regclass, OUT rel regclass, OUT colname name) maybe? On a pure consistency basis: pg_get_sequence_serial(...) [though probably plural: _serials(...)] I'd drop the serial part altogether for the more appropriate: pg_get_sequence_ownedby(...) Given that ALTER SEQUENCE ... OWNED BY ... Is the corresponding SQL The inverse of what you proposed above would probably be more like: pg_get_owned_sequence(...) Reminder: sequences can be unowned. Ownership and usage via default are separate things though: do you have need to know all users of a sequence or only the single one that is defined as it's owner? pg_get_sequence_users(...) [or serials: as noted first] David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Inverse-of-pg-get-serial-sequence-tp5816933p5816993.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inverse of pg_get_serial_sequence?
On 2014-08-29 17:55:38 -0700, David G Johnston wrote: Andres Freund-3 wrote Hi, We have pg_get_serial_sequence() mapping (relation, colum) to the sequence. What I'm missing right now is the inverse. I.e. given a sequence tell me the owner. describe.c has a query for that, and it's not too hard to write, but it still seems 'unfriendly' not to provide it. Does anybody dislike adding a function for that? I can't really think of a good name (not that pg_get_serial_sequence is well named). pg_get_serial_sequence_owner(serial regclass, OUT rel regclass, OUT colname name) maybe? On a pure consistency basis: pg_get_sequence_serial(...) [though probably plural: _serials(...)] Yea, but that's just horrid. I'd drop the serial part altogether for the more appropriate: pg_get_sequence_ownedby(...) My problem is that that possibly be confused with the user owning the sequence :/ Reminder: sequences can be unowned. Don't you say. Ownership and usage via default are separate things though: do you have need to know all users of a sequence or only the single one that is defined as it's owner? I'd rather know all its users, but that's not really possible in the general case without guessing. I'll settle for the column that's declared as owning it. Even if we had a interface for guessing I'd not want it to be the same as the one returning the declared owner. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inverse of pg_get_serial_sequence?
Andres Freund-3 wrote On 2014-08-29 17:55:38 -0700, David G Johnston wrote: Andres Freund-3 wrote pg_get_sequence_ownedby(...) My problem is that that possibly be confused with the user owning the sequence :/ Though as soon as that person reads the output their misunderstanding would be obvious. I think it's fine but ownedbycol or owningcol would be ok. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Inverse-of-pg-get-serial-sequence-tp5816933p5816996.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sat, Aug 30, 2014 at 12:56 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja reference Tom's post http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us which mentions the possibility of a different partitioning implementation than what we have so far. As it turns out, I've been thinking about partitioning recently, so I thought I would share what I'm thinking so that others can poke holes. My intention is to try to implement this as soon as possible. +1. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER SYSTEM RESET?
On Wed, Aug 27, 2014 at 7:16 PM, Fujii Masao masao.fu...@gmail.com wrote: The patch looks good to me. One minor comment is; probably you need to update the tab-completion code. Thanks for the review. I have updated the patch to support tab-completion. As this is a relatively minor change, I will mark it as Ready For Committer rather than Needs Review. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com alter_system_reset.v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers