Re: [HACKERS] Collect frequency statistics for arrays
Hi! Thanks for your fixes to the patch. Them looks correct to me. I did some fixes in the patch. The proof of some concepts is still needed. I'm going to provide it in a few days. On Thu, Jan 12, 2012 at 3:06 PM, Noah Misch n...@leadboat.com wrote: I'm not sure about shared lossy counting module, because part of shared code would be relatively small. Part of compute_array_stats function which is taking care about array decompression, distinct occurence calculation, disting element count histogram, packing statistics slots etc is much larger than lossy counting algorithm itself. May be, there is some other opinions in community? True; it would probably increase total lines of code. The benefit, if any, lies in separation of concerns; the business of implementing this algorithm is quite different from the other roles of these typanalyze functions. I won't insist that you try it, though. I'd prefer to try it as separate patch. + /* + * The probability of no occurence of events which forms rest + * probability have a limit of exp(-rest) when number of events fo to + * infinity. Another simplification is to replace that events with one + * event with (1 - exp(-rest)) probability. + */ + rest = 1.0f - exp(-rest); What is the name of the underlying concept in probability theory? The most closest concept to caculated distribution is multinomial distribution. But it's not exactly same, because multinomial distribution gives probability of particular count of each event occurece, not probability of summary occurence. Actually, distribution is caclulated just from assumption of events independence. The most closest concept of rest probability is approximation by exponential distribution. It's quite rough approximation, but I can't invent something better with low calculation complexity. Do you have a URL of a tutorial or paper that explains the method in more detail? If, rather, this is a novel synthesis, could you write a proof to include in the comments? Unfortunately I don't have relevant paper for it. I'll try to search it. Otherwise I'll try to do some proof. + /* + * Array selectivity estimation based on most common elements statistics for + * column @ const case. Assumption that element occurences are independent + * means certain distribution of array lengths. Typically real distribution + * of lengths is significantly different from it. For example, if even we + * have set of arrays with 1 integer element in range [0;10] each, element + * occurences are not independent. Because in the case of independence we Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but '{6,19,4}' cannot appear? I refer column where only one element exists, i.e. only possible values are '{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}', '{10}'. That is a corner case. But similar situation occurs when, for example, we've distribution of distinct element count between 1 and 3. It significantly differs from distribution from independent occurence. Oh, I think I see now. If each element 1..10 had frequency 0.1 independently, column values would have exactly one distinct element just 39% of the time? Yes, it's right. If probability theory has a prototypical problem resembling this, it would be nice to include a URL to a thorough discussion thereof. I could not think of the search terms to find one, though. Actually, usage of both distinct element count histogram and element occurrence frequencies produce some probability distribution (which is more complex than just independent element occurrence). If real distribution is close this distribution, then estimate is accurate. I didn't met such distributions is papers, actually I've just invented it in my tries to do accurate column @ const estimation at least in simple cases. I'll try to search for similar things in papers, but I doubt I'll succeed. Otherwise I'll try to do some more detailed proof. *** /dev/null --- b/src/backend/utils/adt/array_selfuncs.c + Selectivity + calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause, + Oid operator) + { + Oid elemtype; + Selectivity selec; + TypeCacheEntry *typentry; + Datum *hist; + int nhist; + FunctionCallInfoData cmpfunc; + + elemtype = get_base_element_type(vardata-vartype); + + + /* Get default comparison function */ + typentry = lookup_type_cache(elemtype, +TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO | TYPECACHE_EQ_OPR); + + /* Handle only = operator. Return default selectivity in other cases. */ + if (operator !=
Re: [HACKERS] Speed dblink using alternate libpq tuple storage
Hello, This is revised and rebased version of the patch. a. Old term `Add Tuple Function' is changed to 'Store Handler'. The reason why not `storage' is simply length of the symbols. b. I couldn't find the place to settle PGgetAsCString() in. It is removed and storeHandler()@dblink.c touches PGresAttValue directly in this new patch. Definition of PGresAttValue stays in lipq-fe.h and provided with comment. c. Refine error handling of dblink.c. I think it preserves the previous behavior for column number mismatch and type conversion exception. d. Document is revised. It jumped from 332K tuples/sec to 450K, a 35% gain, and had a lower memory footprint too. Test methodology and those results are at http://archives.postgresql.org/pgsql-hackers/2011-12/msg8.php It is a disappointment that I found that the gain had become lower than that according to the re-measuring. For CentOS6.2 and other conditions are the same to the previous testing, the overall performance became hihger and the loss of libpq patch was 1.8% and the gain of full patch had been fallen to 5.6%. But the reduction of the memory usage was not changed. Original : 3.96s 100.0% w/libpq patch: 4.03s 101.8% w/libpq+dblink patch : 3.74s 94.4% The attachments are listed below. libpq_altstore_20120117.patch - Allow alternative storage for libpql. dblink_perf_20120117.patch - Modify dblink to use alternative storage mechanism. libpq_altstore_doc_20120117.patch - Document for libpq_altstore. Shows in 31.19. Alternatie result storage regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 1af8df6..83525e1 100644 --- a/src/interfaces/libpq/exports.txt +++ b/src/interfaces/libpq/exports.txt @@ -160,3 +160,6 @@ PQconnectStartParams 157 PQping158 PQpingParams 159 PQlibVersion 160 +PQregisterStoreHandler 161 +PQgetStoreHandlerParam 163 +PQsetStoreHandlerErrMes 164 diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index d454538..5559f0b 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -2692,6 +2692,7 @@ makeEmptyPGconn(void) conn-allow_ssl_try = true; conn-wait_ssl_try = false; #endif + conn-storeHandler = NULL; /* * We try to send at least 8K at a time, which is the usual size of pipe @@ -5076,3 +5077,10 @@ PQregisterThreadLock(pgthreadlock_t newhandler) return prev; } + +void +PQregisterStoreHandler(PGconn *conn, StoreHandler func, void *param) +{ + conn-storeHandler = func; + conn-storeHandlerParam = param; +} diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index b743566..96e5974 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -67,6 +67,10 @@ static int PQsendDescribe(PGconn *conn, char desc_type, const char *desc_target); static int check_field_number(const PGresult *res, int field_num); +static void *pqDefaultStoreHandler(PGresult *res, PQStoreFunc func, + int id, size_t len); +static void *pqAddTuple(PGresult *res, PGresAttValue *tup); + /* * Space management for PGresult. @@ -160,6 +164,9 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) result-curBlock = NULL; result-curOffset = 0; result-spaceLeft = 0; + result-storeHandler = pqDefaultStoreHandler; + result-storeHandlerParam = NULL; + result-storeHandlerErrMes = NULL; if (conn) { @@ -194,6 +201,12 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) } result-nEvents = conn-nEvents; } + + if (conn-storeHandler) + { + result-storeHandler = conn-storeHandler; + result-storeHandlerParam = conn-storeHandlerParam; + } } else { @@ -487,6 +500,33 @@ PQresultAlloc(PGresult *res, size_t nBytes) return pqResultAlloc(res, nBytes, TRUE); } +void * +pqDefaultStoreHandler(PGresult *res, PQStoreFunc func, int id, size_t len) +{ + void *p; + + switch (func) + { + case PQSF_ALLOC_TEXT: + return pqResultAlloc(res, len, TRUE); + + case PQSF_ALLOC_BINARY: + p = pqResultAlloc(res, len, FALSE); + + if (id == -1) +res-storeHandlerParam = p; + + return p; + + case PQSF_ADD_TUPLE: + return pqAddTuple(res, res-storeHandlerParam); + + default: + /* Ignore */ + break; + } + return NULL; +} /* * pqResultAlloc - * Allocate subsidiary storage for a PGresult. @@ -830,9 +870,9 @@ pqInternalNotice(const PGNoticeHooks *hooks, const char *fmt,...) /* * pqAddTuple * add a row pointer to the PGresult structure, growing it if necessary - * Returns TRUE if OK, FALSE if not enough memory to add the row + * Returns tup if OK, NULL if not enough memory to add the row. */ -int +static void * pqAddTuple(PGresult *res, PGresAttValue *tup) { if (res-ntups = res-tupArrSize) @@ -858,13 +898,13 @@ pqAddTuple(PGresult *res,
Re: [HACKERS] foreign key locks, 2nd attempt
On Sun, Jan 15, 2012 at 01:49:54AM -0300, Alvaro Herrera wrote: - I'm not sure that the multixact truncation code is sane on checkpoints. It might be that I need to tweak more the pg_control info we keep about truncation. The whole truncation thing needs more testing, too. My largest outstanding concern involves the possibility of MultiXactId wraparound. From my last review: This raises a notable formal hazard: it's possible to burn through the MultiXactId space faster than the regular TransactionId space. We could get into a situation where pg_clog is covering 2B xids, and yet we need 4B MultiXactId to cover that period. We had better at least notice this and halt, if not have autovacuum actively prevent it. (That should have been 2B rather than 4B, since MultiXactId uses the same 2B-in-past, 2B-in-future behavior as regular xids.) Are we willing to guess that this will never happen and make recovery minimally possible? If so, we could have GetNewMultiXactId() grow defenses similar to GetNewTransactionId() and leave it at that. If not, we need to involve autovacuum. The other remaining high-level thing is to have key_attrs contain only columns actually referenced by FKs. - Go over Noah's two reviews again and see if I missed anything; also make sure I haven't missed anything from other reviewers. There are some, yes. *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** *** 2773,2783 l2: } else if (result == HeapTupleBeingUpdated wait) { ! TransactionId xwait; uint16 infomask; /* must copy state data before unlocking buffer */ ! xwait = HeapTupleHeaderGetXmax(oldtup.t_data); infomask = oldtup.t_data-t_infomask; LockBuffer(buffer, BUFFER_LOCK_UNLOCK); --- 2848,2871 } else if (result == HeapTupleBeingUpdated wait) { ! TransactionId xwait; uint16 infomask; + boolnone_remain = false; Nothing can ever set this variable to anything different. It seems that keep_xact == InvalidTransactionId substitutes well enough, though. /* * We may overwrite if previous xmax aborted, or if it committed but ! * only locked the tuple without updating it, or if we are going to ! * keep it around in Xmax. */ The second possibility is just a subset of the third. ! if (TransactionIdIsValid(keep_xmax) || ! none_remain || ! (oldtup.t_data-t_infomask HEAP_XMAX_INVALID)) result = HeapTupleMayBeUpdated; else result = HeapTupleUpdated; *** *** 3314,3323 heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum, */ static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs, !HeapTuple oldtup, HeapTuple newtup) { int attrnum; while ((attrnum = bms_first_member(hot_attrs)) = 0) { /* Adjust for system attributes */ --- 3537,3549 */ static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs, !HeapTuple oldtup, HeapTuple newtup, bool empty_okay) { int attrnum; + if (!empty_okay bms_is_empty(hot_attrs)) + return false; When a table contains no key attributes, it seems arbitrary whether we call the key revoked or not. What is the motivation for this change? ! /* ! * If we're requesting KeyShare, and there's no update present, we ! * don't need to wait. Even if there is an update, we can still ! * continue if the key hasn't been modified. ! * ! * However, if there are updates, we need to walk the update chain ! * to mark future versions of the row as locked, too. That way, if ! * somebody deletes that future version, we're protected against the ! * key going away. This locking of future versions could block ! * momentarily, if a concurrent transaction is deleting a key; or it ! * could return a value to the effect that the transaction deleting the ! * key has already committed. So we do this before re-locking the ! * buffer; otherwise this would be prone to deadlocks. Note that the TID ! * we're locking was grabbed before we unlocked the buffer. For it to ! * change while we're not looking, the other properties we're testing ! * for below after re-locking the buffer would also change, in which !
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Jan 16, 2012 at 04:52:36PM -0300, Alvaro Herrera wrote: Excerpts from Heikki Linnakangas's message of lun ene 16 16:17:42 -0300 2012: On 15.01.2012 06:49, Alvaro Herrera wrote: --- 164,178 #define HEAP_HASVARWIDTH0x0002/* has variable-width attribute(s) */ #define HEAP_HASEXTERNAL0x0004/* has external stored attribute(s) */ #define HEAP_HASOID0x0008/* has an object-id field */ ! #define HEAP_XMAX_KEYSHR_LOCK0x0010/* xmax is a key-shared locker */ #define HEAP_COMBOCID0x0020/* t_cid is a combo cid */ #define HEAP_XMAX_EXCL_LOCK0x0040/* xmax is exclusive locker */ ! #define HEAP_XMAX_IS_NOT_UPDATE0x0080/* xmax, if valid, is only a locker. ! * Note this is not set unless ! * XMAX_IS_MULTI is also set. */ ! ! #define HEAP_LOCK_BITS(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_IS_NOT_UPDATE | \ ! HEAP_XMAX_KEYSHR_LOCK) #define HEAP_XMIN_COMMITTED0x0100/* t_xmin committed */ #define HEAP_XMIN_INVALID0x0200/* t_xmin invalid/aborted */ #define HEAP_XMAX_COMMITTED0x0400/* t_xmax committed */ HEAP_XMAX_IS_NOT_UPDATE is a pretty opaque name for that. From the name, I'd say that a DELETE would set that, but the comment says it has to do with locking. After going through all the combinations in my mind, I think I finally understood it: HEAP_XMAX_IS_NOT_UPDATE is set if xmax is a multi-xact, that represent only locking xids, no updates. How about calling it HEAP_XMAX_LOCK_ONLY, and setting it whenever whether or not xmax is a multi-xid? Hm, sounds like a good idea. Will do. +1 Why are you renaming HeapTupleHeaderGetXmax() into HeapTupleHeaderGetRawXmax()? Any current callers of HeapTupleHeaderGetXmax() should already check that HEAP_XMAX_IS_MULTI is not set. I had this vague impression that it'd be better to break existing callers so that they ensure they decide between HeapTupleHeaderGetRawXmax and HeapTupleHeaderGetUpdateXid. Noah suggested changing the macro name, too. It's up to each caller to decide what's the sematics they want. Most want the latter; and callers outside core are more likely to want that one. If we kept the old name, they would get the wrong value. My previous suggestion was to have both macros: #define HeapTupleHeaderGetXmax(tup) \ ( \ AssertMacro(!((tup)-t_infomask HEAP_XMAX_IS_MULTI)), \ HeapTupleHeaderGetRawXmax(tup) \ ) #define HeapTupleHeaderGetRawXmax(tup) \ ( \ (tup)-t_choice.t_heap.t_xmax \ ) No strong preference, though. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BGWriter latch, power saving
On 04.01.2012 17:05, Peter Geoghegan wrote: On 4 January 2012 07:24, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I think SetBufferCommitInfoNeedsSave() needs the same treatment as MarkBufferDirty(). And it would probably be good to only set the latch if the buffer wasn't dirty already. Setting a latch that's already set is fast, but surely it's even faster to not even try. That seems reasonable. Revised patch is attached. Thanks! It occurs to me that it's still a bad idea to call SetLatch() while holding the buffer header spinlock. At least when it's trivial to just move the call after releasing the lock. See attached. Could you do the sleeping/hibernating logic all in BgWriterNap()? Yeah, I'd like to see a micro-benchmark of a worst-case scenario. I'm a bit worried about the impact on systems with a lot of CPUs. If you have a lot of CPUs writing to the same cache line that contains the latch's flag, that might get expensive. Also reasonable, but I don't think that I'll get around to it until after the final commitfest deadline. That's still pending. And I admit I haven't tested my updated version besides make check. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/postmaster/bgwriter.c --- b/src/backend/postmaster/bgwriter.c *** *** 51,56 --- 51,58 #include storage/ipc.h #include storage/lwlock.h #include storage/pmsignal.h + #include storage/proc.h + #include storage/procsignal.h #include storage/shmem.h #include storage/smgr.h #include storage/spin.h *** *** 73,83 static volatile sig_atomic_t shutdown_requested = false; /* * Private state */ static bool am_bg_writer = false; /* Prototypes for private functions */ ! static void BgWriterNap(void); /* Signal handlers */ --- 75,87 /* * Private state */ + #define BGWRITER_HIBERNATE_MS 1 + static bool am_bg_writer = false; /* Prototypes for private functions */ ! static void BgWriterNap(bool hibernating); /* Signal handlers */ *** *** 97,102 BackgroundWriterMain(void) --- 101,107 { sigjmp_buf local_sigjmp_buf; MemoryContext bgwriter_context; + bool hibernating; am_bg_writer = true; *** *** 123,129 BackgroundWriterMain(void) pqsignal(SIGQUIT, bg_quickdie); /* hard crash time */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); ! pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */ pqsignal(SIGUSR2, SIG_IGN); /* --- 128,134 pqsignal(SIGQUIT, bg_quickdie); /* hard crash time */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); ! pqsignal(SIGUSR1, procsignal_sigusr1_handler); pqsignal(SIGUSR2, SIG_IGN); /* *** *** 139,144 BackgroundWriterMain(void) --- 144,155 sigdelset(BlockSig, SIGQUIT); /* + * Advertise our latch that backends can use to wake us up while we're + * sleeping. + */ + ProcGlobal-bgwriterLatch = MyProc-procLatch; + + /* * Create a resource owner to keep track of our resources (currently only * buffer pins). */ *** *** 235,242 BackgroundWriterMain(void) --- 246,256 /* * Loop forever */ + hibernating = false; for (;;) { + bool lapped; + /* * Emergency bailout if postmaster has died. This is to avoid the * necessity for manual cleanup of all postmaster children. *** *** 264,281 BackgroundWriterMain(void) /* * Do one cycle of dirty-buffer writing. */ ! BgBufferSync(); ! /* Nap for the configured time. */ ! BgWriterNap(); } } /* * BgWriterNap -- Nap for the configured time or until a signal is received. */ static void ! BgWriterNap(void) { long udelay; --- 278,333 /* * Do one cycle of dirty-buffer writing. */ ! lapped = BgBufferSync(); ! ! if (lapped !hibernating) ! { ! /* ! * Initiate hibernation by arming the process latch. This usage ! * differs from the standard pattern for latches outlined in ! * latch.h, where the latch is generally reset immediately after ! * WaitLatch returns, in the next iteration of an infinite loop. ! * ! * It should only be possible to *really* set the latch from client ! * backends while the bgwriter is idle, and not during productive ! * cycles where buffers are written, or shortly thereafter. It's ! * important that the SetLatch() call within the buffer manager ! * usually inexpensively returns having found that the latch is ! * already set. For write-heavy workloads, it's quite possible that ! * those calls will never actually get the opportunity to really ! * set the latch. ! * ! * By only giving backends the opportunity to really set the ! * latch at this point via ResetLatch(), we avoid the overhead of ! * sending a signal perhaps as much
Re: [HACKERS] Collect frequency statistics for arrays
On Tue, Jan 17, 2012 at 12:04:06PM +0400, Alexander Korotkov wrote: Thanks for your fixes to the patch. Them looks correct to me. I did some fixes in the patch. The proof of some concepts is still needed. I'm going to provide it in a few days. Your further fixes look good. Could you also answer my question about the header comment of mcelem_array_contained_selec()? /* * Estimate selectivity of column @ const based on most common element * statistics. Independent element occurrence would imply a particular * distribution of distinct element counts among matching rows. Real data * usually falsifies that assumption. For example, in a set of 1-element * integer arrays having elements in the range [0;10], element occurrences are * not independent. If they were, a sufficiently-large set would include all * distinct element counts 0 through 11. We correct for this using the * histogram of distinct element counts. * * In the column @ const and column const cases, we usually have * const with low summary frequency of elements (otherwise we have * selectivity close to 0 or 1 correspondingly). That's why the effect of * dependence related to distinct element counts distribution is negligible * there. In the column @ const case, summary frequency of elements is * high (otherwise we have selectivity close to 0). That's why we should do * correction due to array distinct element counts distribution. */ By summary frequency of elements, do you mean literally P_0 + P_1 ... + P_N? If so, I can follow the above argument for column const and column @ const, but not for column @ const. For column @ const, selectivity cannot exceed the smallest frequency among const elements. A number of high-frequency elements will drive up the sum of the frequencies without changing the true selectivity much at all. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Restore process during recovery
On Tue, Jan 17, 2012 at 6:52 AM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Jan 16, 2012 at 2:06 AM, Simon Riggs si...@2ndquadrant.com wrote: WALRestore process asynchronously executes restore_command while recovery continues working. Overlaps downloading of next WAL file to reduce time delays in file based archive recovery. Handles cases of file-only and streaming/file correctly. Though I've not reviewed the patch deeply yet, I observed the following two problems when I tested the patch. When I set up streaming replication + archive (i.e., restore_command is set) and started the standby, I got the following error: FATAL: all AuxiliaryProcs are in use LOG: walrestore process (PID 18839) exited with exit code 1 When I started an archive recovery without setting restore_command, it successfully finished. Oh, I did have NUM_AUXILIARY_PROCS increased at one point, but I realised it wasn't needed and removed it. Will change that. Thanks. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM in SP-GiST
hi, I started reading the spgist vacuum code last night, and didn't like it at all. I found a number of smaller issues, but it seems to me that the design is just fundamentally wrong. Unless I'm misunderstanding it, the approach is to recursively traverse the tree in sort of the same way that a full-index search would do, except that it continues to hold lock on the immediate parent page of whichever page is currently being visited, so that it can update the downlink if it has to delete the first leaf tuple of a chain. I've got several complaints about that: 1. Holding exclusive locks on upper pages while we visit possibly-hundreds of leaf pages seems pretty bad from a concurrency standpoint. It doesn't hold locks all the way up to the root, so it's not a complete loss, but even a first-level inner page can have a lot of children. 2. I do not trust this code to visit every leaf page (which, if it failed to do so and thereby missed deleting a heap reference, would result in a silently corrupt index). Unlike a regular index search, which makes a list of lower-level targets while examining an inner tuple just once, this code depends on the assumption that it can reacquire lock on an upper page and then resychronize itself with whatever's been done meanwhile to the inner tuple it was looking at. I don't trust that. The mechanism that's being used relies on page LSN having been changed if anything was changed on the page, which does not work in unlogged tables. Furthermore, if the inner tuple did change, it has to rescan *everything* that the inner tuple leads to. That's horrendously inefficient, and it's not even clear that it would ever terminate in the face of a constant stream of concurrent updates. 3. Even if it all worked, it would be slow as can be, because it requires a whole lot of nonsequential accesses to the index. For instance, the same leaf page might be visited many times (once for each leaf chain on the page), and not necessarily close together either. Also, the code doesn't seem to perform any cleanup at all on inner pages. I am not expecting it to try to get rid of childless inner tuples, but surely it ought to at least convert old redirects to dead and get rid of dead tuples at the end of the page, much as for leaf pages? What I'm envisioning doing instead is making a single sequential scan over the entire index. On both leaf and inner pages we could clean redirects and remove end dead tuples. On leaf pages we'd also check for tuples to be deleted. There's no real difficulty in removing deleted tuples as long as they are not at the heads of their chains. (The code would have to reverse-engineer which tuples are at the heads of their chains, but that's easily done while scanning the page; we just make a reverse-lookup map for the nextOffset pointers.) If we have to delete a tuple at the head of its chain, but there's still at least one live tuple in its chain, we could reshuffle the tuples to bring another live tuple to that same offset number, thereby not invalidating the parent downlink. The only hard part is what to do when we delete all members of a chain: we can't reset the parent downlink because we do not know where it is. What I propose to do in that case is install a variant form of dead tuple that just indicates this chain is empty. One way to represent that would be a redirect tuple pointing nowhere, but I think it would be cleaner to repurpose the isDead and isRedirect bits as a two-bit field with four states. We'd have LIVE, DEAD, REDIRECT, and this fourth state for a dead tuple that is not recyclable because we know there's a link to it somewhere. A scan would ignore such a tuple. An insertion that arrives at such a tuple could either replace it (if there's enough room on the page) or convert it to a redirect (if not). Last night I was thinking the fourth state could be named TOMBSTONE, but maybe it'd be better to use DEAD for this case and rename the existing removable dead tuple state to PLACEHOLDER, since that case only exists to preserve the offset numbers of other tuples on the page. Comments? with a concurrent split moving leaf tuples between pages, can't it make bulkdelete fail reporting some TIDs and make CREATE INDEX CONCURRENTLY create duplicates? YAMAMOTO Takashi regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
Excerpts from Greg Smith's message of lun ene 16 19:25:50 -0300 2012: On 01/16/2012 03:48 PM, Josh Berkus wrote: I'll also point out that the process for *applying* a patch, if you don't subscribe to hackers and keep archives around on your personal machine for months, is also very cumbersome and error-prone. Copy and paste from a web page? Really? The most reasonable answer to this is for people to publish a git repo URL in addition to the official submission of changes to the list in patch form. It's expected that we'll get a more reasonable interface to attachments, one that will allow you to download patches separately. (Currently, attachments that have mime types other than text/plain are already downloadable separately). -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Generate call graphs in run-time
On Mon, Jan 16, 2012 at 2:23 PM, Martin Pihlak martin.pih...@gmail.com wrote: My approach was to add parent oid to the per-backend function stats structure - PgStat_BackendFunctionEntry. Also, I changed the hash key for that structure to (oid, parent) pair. This means that within the backend the function usage is always tracked with the context of calling function. This has the nice property that you get the per-parent usage stats as well. Also the additional lists for parent tracking are avoided. During pgstat_report_stat() the call graph (with stats) is output to logs and the statistics uploaded to collector -- with the parent oid removed. Since you only care about the parentfuncid in one level, it looks like you will only be able to get a total call graph of all possible function calls, and not each unique call graph per transaction. If you have two separate transactions: T1: a-b, b-c T2: b-d You would have two unique call graphs {a-b, b-c} and {b-d}. The global call graph, where you only keep track of all unique parentfuncid-funcid pairs, would be {a-b, b-c, b-d}, which lacks the information on what different call graphs are actually being executed per transaction. Also, why remove the parent oid when uploading the statistics to the collector? It would be nice to have the statistics for each function per parent, to see where you have a bottleneck which might only be occurring in a function when called from a specific parent. Even more fine-grained would be to have the statistics per unique call-graph, i.e. the entire tree of functions called in the transactions. There is a patch for this and we do use it in production for occasional troubleshooting and dependency analysis. Can't attach immediately though -- it has some extra cruft in it that needs to be cleaned up. I would highly appreciate a patch, don't worry about cleaning up, I can do that, unless it's some code you can't share for other reasons. Indeed. Something like a pg_stat_user_function_details view would be very useful. Something along the lines of: Column | Type | --++ funcid | oid | parentfuncid | oid | -- new schemaname | name | funcname | name | calls | bigint | total_time | bigint | self_time | bigint | funcid-parentfuncid might be sufficient for performance optimizations, but to automatically generate directional graphs of all unique call graphs in run-time, you would need all the unique pairs of funcid-parentfuncid as a singel column, probably a sorted array of oids[][], example: [[1,2],[1,3],[2,4],[2,5]] if the call craph would be {1-2, 1-3, 2-4, 2-5}. And then rewrite pg_stat_user_functions by aggregating the detailed view. That'd make the individual pg_stat_get_function* functions a bit slower, but that is probably a non-issue - at least not if the pg_stat_user_functions view is rewritten to use a SRF. regards, Martin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BGWriter latch, power saving
On 17.01.2012 12:16, Heikki Linnakangas wrote: On 04.01.2012 17:05, Peter Geoghegan wrote: On 4 January 2012 07:24, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I think SetBufferCommitInfoNeedsSave() needs the same treatment as MarkBufferDirty(). And it would probably be good to only set the latch if the buffer wasn't dirty already. Setting a latch that's already set is fast, but surely it's even faster to not even try. That seems reasonable. Revised patch is attached. Thanks! It occurs to me that it's still a bad idea to call SetLatch() while holding the buffer header spinlock. At least when it's trivial to just move the call after releasing the lock. See attached. Could you do the sleeping/hibernating logic all in BgWriterNap()? (sorry, forgot to update the above question before sending..) In the patch I sent, I did rearrange the sleeping logic. I think it's more readable the way it is now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] review: psql tab completion for GRANT role
Hello I did review of this small patch http://archives.postgresql.org/message-id/1326055692.15293.15.ca...@vanquo.pezone.net * This patch was applied without with one hunk and compiled without warnings bash-4.2$ patch -p1 autocompleta.patch patching file src/bin/psql/tab-complete.c Hunk #2 succeeded at 2335 with fuzz 1. * all regress tests was processed * this patch respect Pg coding standards * code is simple and has not impact on core functionality or core performance * regress tests are not requested for interactive function * I checked it, and autocomplete has expected behave. It is ready for commit Regards Pavel Stehule -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 feature proposal: vacuumdb -j #
Am 13.01.2012 22:50, schrieb Josh Berkus: Hackers, It occurs to me that I would find it quite personally useful if the vacuumdb utility was multiprocess capable. For example, just today I needed to manually analyze a database with over 500 tables, on a server with 24 cores. And I needed to know when the analyze was done, because it was part of a downtime. I had to resort to a python script. I'm picturing doing this in the simplest way possible: get the list of tables and indexes, divide them by the number of processes, and give each child process its own list. Any reason not to hack on this for 9.3? Hello, I like the idea - but ... I would prefer to have an option that the user is able to tell on how much cores it should be shared. Something like --share-cores=N. Default is total core number of the machine but users should be able to say - ok - my machine has 24 cores but I want that vacuumdb just will use 12 of them. Especially on startups - you are able to find machines that aren't database-only machines. Often you find database and web server as single machine. Also you could have run more cluster on same machine for offering your business in different languages (one cluster per language). I already saw such a setup. There might be side businesses on the cores - so it should be possible that the users decides on how much cores he wants to share vacuumdb. Susanne -- Dipl. Inf. Susanne Ebrecht - 2ndQuadrant PostgreSQL Development, 24x7 Support, Training and Services www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 feature proposal: vacuumdb -j #
On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote: I would prefer to have an option that the user is able to tell on how much cores it should be shared. Something like --share-cores=N. Uhm. -j # does exactly that or am I missing your point? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 feature proposal: vacuumdb -j #
On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund and...@anarazel.de wrote: On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote: I would prefer to have an option that the user is able to tell on how much cores it should be shared. Something like --share-cores=N. Uhm. -j # does exactly that or am I missing your point? not really. if you have 12 cores and you say -j 12 you would have 1 process per core, with Susanne's suggestion, AFAIUI, you can say -j 12 --shared-cores=6... so you would only use 6 cores of the 12 and have 2 processes per core -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BGWriter latch, power saving
On 17 January 2012 11:24, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: In the patch I sent, I did rearrange the sleeping logic. I think it's more readable the way it is now. I have no objection to either your refinement of the sleeping logic, nor that you moved some things in both the existing code and my patch so that they occur when no spinlock is held. Should I proceed with a benchmark on V3, so that we can get this committed? I imagine that a long pgbench-tools run is appropriate, (after all, it was used to justify the re-write of the BGWriter for 8.3) at various scale factors, from smallish to quite large. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
Tom Lane t...@sss.pgh.pa.us writes: Anyway, I'm hoping to keep hacking at this for a couple more days before the CF gets into full swing, and hopefully arrive at something committable for 9.2. I'd really like to get this fixed in this cycle, since it's been a problem for several releases now. Comments? Go Tom go ! :) Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 feature proposal: vacuumdb -j #
On 01/17/2012 07:33 AM, Jaime Casanova wrote: On Tue, Jan 17, 2012 at 7:23 AM, Andres Freundand...@anarazel.de wrote: On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote: I would prefer to have an option that the user is able to tell on how much cores it should be shared. Something like --share-cores=N. Uhm. -j # does exactly that or am I missing your point? not really. if you have 12 cores and you say -j 12 you would have 1 process per core, with Susanne's suggestion, AFAIUI, you can say -j 12 --shared-cores=6... so you would only use 6 cores of the 12 and have 2 processes per core That looks messy. IMNSHO it should work just like pg_restore's -j. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 feature proposal: vacuumdb -j #
On Tuesday, January 17, 2012 01:33:06 PM Jaime Casanova wrote: On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund and...@anarazel.de wrote: On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote: I would prefer to have an option that the user is able to tell on how much cores it should be shared. Something like --share-cores=N. Uhm. -j # does exactly that or am I missing your point? not really. if you have 12 cores and you say -j 12 you would have 1 process per core, with Susanne's suggestion, AFAIUI, you can say -j 12 --shared-cores=6... so you would only use 6 cores of the 12 and have 2 processes per core I don't really get what that should do. If vacuumdb itself is a limit in any form in this we did something *very* wrong (in my opinion using processes for this is pointless anyway. Using async queries seems to be much easier for this special case. Especially for distributing individual commands.). I don't really see how you could enforce sharing cores on the server side (well, there are cpusets, but were sure not introduce usage of that just for vacuumdb). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On 16 January 2012 08:11, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Impressive results. How about uploading the PDF to the community wiki? Sure. http://wiki.postgresql.org/wiki/Group_commit . I think it might be simpler if it wasn't the background writer that's responsible for driving the group commit queue, but the backends themselves. When a flush request comes in, you join the queue, and if someone else is already doing the flush, sleep until the driver wakes you up. If no-one is doing the flush yet (ie. the queue is empty), start doing it yourself. You'll need a state variable to keep track who's driving the queue, but otherwise I think it would be simpler as there would be no dependency on WAL writer. I think this replaces one problem with another. You've now effectively elevated a nominated backend to the status of an auxiliary process - do you intend to have the postmaster look after it, as with any other auxiliary process? I'm not sure that that is a more difficult problem to solve, but I suspect so. At least my proposal can have any one of the backends, both currently participating in group commit and yet to, wake up the WAL Writer. I tend think of the group commit facility as a bus. Passengers can hop on board at any time, and they take turns on who drives the bus. When the first passengers hops in, there is no driver so he takes the driver seat. When the next passenger hops in, he sees that someone is driving the bus already, so he sits down, and places a big sign on his forehead stating the bus stop where he wants to get off, and goes to sleep. When the driver has reached his own bus stop, he wakes up all the passengers who wanted to get off at the same stop or any of the earlier stops [1]. He also wakes up the passenger whose bus stop is the farthest from the current stop, and gets off the bus. The woken-up passengers who have already reached their stops can immediately get off the bus, and the one who has not notices that no-one is driving the bus anymore, so he takes the driver seat. [1] in a real bus, a passenger would not be happy if he's woken up too late and finds himself at the next stop instead of the one where he wanted to go, but for group commit, that is fine. In this arrangement, you could use the per-process semaphore for the sleep/wakeups, instead of latches. I'm not sure if there's any difference, but semaphores are more tried and tested, at least. Yes, and I expect that this won't be the last time someone uses a bus analogy in relation to this! The proposed patch is heavily based on sync rep, which I'd have imagined was more tried and tested than any proposed completely alternative implementation, as it is basically a generalisation of exactly the same principle, WAL Writer changes notwithstanding. I would have imagined that that aspect would be particularly approved of. wal_writer_delay is still needed for controlling how often asynchronous commits are flushed to disk. That had occurred to me of course, but has anyone ever actually tweaked wal_writer_delay with adjusting the behaviour of asynchronous commits in mind? I'm pretty sure that the answer is no. I have a slight preference for obsoleting it as a consequence of introducing group commit, but I don't think that it matters that much. Auxiliary processes cannot use group commit. The changes made prevent them from availing of commit_siblings/commit_delay parallelism, because it doesn't exist anymore. Auxiliary processes have PGPROC entries too. Why can't they participate? It was deemed to be a poor design decision to effectively create a dependency on the WAL Writer among other auxiliary processes, as to do so would perhaps compromise the way in which the postmaster notices and corrects isolated failures. Maybe I'll revisit that assessment, but I am not convinced that it's worth the very careful analysis of the implications of such an unprecedented dependency, without there being some obvious advantage. It it's a question of their being deprived of commit_siblings group commit, well, we know from experience that people didn't tend to touch it a whole lot anyway. Group commit is sometimes throttled, which seems appropriate - if a backend requests that the WAL Writer flush an LSN deemed too far from the known flushed point, that request is rejected and the backend goes through another path, where XLogWrite() is called. Hmm, if the backend doing the big flush gets the WALWriteLock before a bunch of group committers, the group committers will have to wait until the big flush is finished, anyway. I presume the idea of the throttling is to avoid the situation where a bunch of small commits need to wait for a huge flush to finish. Exactly. Of course, you're never going to see that situation with pgbench. I don't have much data to inform exactly what the right trade-off is here, or some generic approximation of it across platforms and hardware - other
Re: [HACKERS] xlog location arithmetic
On 20 December 2011 10:27, Magnus Hagander mag...@hagander.net wrote: Doing it in numeric should be perfectly fine. The only real reason to pick int8 over in this context would be performance, but it's not like this is something that's going to be called in really performance critical paths... FYI, my group commit patch has a little macro, in the spirit of XLByteAdvance, to get the delta between two LSNs in bytes as an uint64. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] psql \timing vs failed statements
Right now, psql \timing output only gives output for successful queries. Is there any actual reason for this, or just a it happened? In particular,I just had a very long run of a CREATE UNIQUE INDEX fail pretty far in - and I would've really liked to have timing output for that one even though it failed. Thus - if I were to change psql to output timing on failed queries as well, will anybody object? ;) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logging in high performance systems.
On Thu, Nov 24, 2011 at 04:28, Theo Schlossnagle je...@omniti.com wrote: So... here's my first whack at solving this with some flexibility. The first thing I did was add hook points where immediate statement logging happens pre_exec and those that present duration post_exec. These should, with optimization turned on, have only a few instructions of impact when no hooks are registered (we could hoist the branch outside the function call if that were identified as an issue). Note that the hook mechanism you've built is a departure from how other hooks are managed in Postgres. Traditionally hooks are just global function pointers, and each consumer is responsible for storing the previous value of the hook and chain-calling it in the handler. If you want to change this pattern, I think you should start another discussion. E.g. hook registration looks like: next_object_access_hook = object_access_hook; object_access_hook = sepgsql_object_access; static void sepgsql_object_access(ObjectAccessType access, Oid classId, Oid objectId, int subId) { if (next_object_access_hook) (*next_object_access_hook) (access, classId, objectId, subId); Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] sepgsql's DROP Permission checks
2012/1/17 Robert Haas robertmh...@gmail.com: On Tue, Jan 10, 2012 at 7:51 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: The attached patch adds OAT_DROP object-access-hook around permission checks of object deletion. Due to the previous drop statement reworks, the number of places to put this hook is limited to these six points: RemoveObjects, RemoveRelations, ATExecDropColumn, dropdb, DropTableSpace and shdepDropOwned(). In sepgsql side, it checks {drop} permission for each object types, and {remove_name} permission to the schema that owning the object being removed. I'm still considering whether the drop permission should be applied on objects being removed in cascade mode. It is not difficult to implement. We can determine the bahavior on object deletion with same manner of creation; that saves contextual information using ProcessUtility hook. At this moment, the current proposed patch does not apply checks on cascaded deletion, according to SQL behavior. However, my concern is that user can automatically have right to remove all the objects underlying a partidular schema being removable, even if individual tables or functions are not able to be removed. So, my preference is sepgsql references dependency tables to check {drop} permissions of involved objects, not only the target object. Hmm. I kind of wonder if we shouldn't just have the OAT_DROP hook get invoked for every actual drop, rather than only for the top-level object. It's somewhat appealing to have the hook more-or-less match up the permissions checks, as you have it here, but in general it seems more useful to have a callback for each thing dropped than to have a callback for each thing named to be dropped. What is your opinion? I think it is more ideal design. If we could track object deletion only top-level, we have no option to implement security features based on the object-access-hook, even if security model is suitable to apply checks all the object deletion. In addition, I believe it should be used to clean-up something set up by external modules, not only permission checks. Do I modify the patch to place object-access-hook on deleteOneObject (probably, it is the best position to track actual deletion)? One problem is case of deletion of columns by ALTER TABLE. It just marks attisdropped flag; without removing catalog entry. Do we ought to put this hook on ATExecDropColumn exceptionally? Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch for parameterized inner paths
On 01/17/2012 12:06 AM, Tom Lane wrote: Well, since I see other committers sending in patches the day after the nominal commitfest deadline, I don't feel too bad about being a bit late as well. To clarify the fairness standard here: submitting a patch before the CommitFest deadline, then adding it to the app, means that we will try very hard to find a reviewer for the submission during that CF. It's setting a worst-case bound on how long someone who contributes will have to wait for feedback. That delay, how long it would take before someone saw community feedback after they sent in a patch, used to be far less predictable. Something like this, sent just after the deadline, won't be assigned a reviewer by the CommitFest manager until the next CF. That doesn't mean it won't be reviewed anyway. Also, submissions that fix a regression, like this one, can easily end up on a fast track unrelated to the normal schedule. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
On mån, 2012-01-16 at 17:25 -0500, Greg Smith wrote: The most reasonable answer to this is for people to publish a git repo URL in addition to the official submission of changes to the list in patch form. Note that the original complaint was that for the occasional reviewer, the current system takes at least 5 partially redundant steps in two different systems. I doubt that adding a third system and more partially redundant steps it going to help that. I don't have anything against the general idea, but it won't address the original point. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
On 01/16/2012 05:40 PM, Alvaro Herrera wrote: It's expected that we'll get a more reasonable interface to attachments, one that will allow you to download patches separately. (Currently, attachments that have mime types other than text/plain are already downloadable separately). Are you really sure about that? My recent JSON patch is at http://archives.postgresql.org/message-id/4f12f9e5.3090...@dunslane.net. I don't see any download link for the patch there, yet my mailer set the attachment type to text/x-patch, not text/plain. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logging in high performance systems.
Excerpts from Marti Raudsepp's message of mar ene 17 12:12:50 -0300 2012: On Thu, Nov 24, 2011 at 04:28, Theo Schlossnagle je...@omniti.com wrote: So... here's my first whack at solving this with some flexibility. The first thing I did was add hook points where immediate statement logging happens pre_exec and those that present duration post_exec. These should, with optimization turned on, have only a few instructions of impact when no hooks are registered (we could hoist the branch outside the function call if that were identified as an issue). Note that the hook mechanism you've built is a departure from how other hooks are managed in Postgres. Traditionally hooks are just global function pointers, and each consumer is responsible for storing the previous value of the hook and chain-calling it in the handler. If you want to change this pattern, I think you should start another discussion. Hm. We already have places doing the other thing, for example see XactCallback and ExprContextCallback. Not sure we have an actual criteria for deciding when to use which. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE
It appears that the only way to create a non-inherited CHECK constraint is using ALTER TABLE. Is there no support in CREATE TABLE planned? That looks a bit odd. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On mån, 2012-01-16 at 14:46 -0500, Robert Haas wrote: On Mon, Jan 16, 2012 at 2:06 PM, Peter Eisentraut pete...@gmx.net wrote: On mån, 2012-01-16 at 11:17 -0500, Robert Haas wrote: I don't see how setting indisvalid to false helps with this, because IIUC when a session sees indisvalid = false, it is supposed to avoid using the index for queries but still make new index entries when a write operation happens - but to drop an index, I think you'd need to get into a state where no one was using the index for anything at all. ISTM that one would need to set indisready to false instead. Maybe we should set both to false? Well, ready = false and valid = true doesn't make any sense. There is only just-created - ready - valid. We might as well convert that to a single char column, as you had indicated in your earlier email. But that's independent of the proposed patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE
Excerpts from Peter Eisentraut's message of mar ene 17 13:59:57 -0300 2012: It appears that the only way to create a non-inherited CHECK constraint is using ALTER TABLE. Is there no support in CREATE TABLE planned? That looks a bit odd. There are no plans to do that AFAIR, though maybe you could convince Nikhil to write the patch to do so. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
On 17/01/2012 17:50, Alvaro Herrera wrote: Excerpts from Matteo Beccati's message of mar ene 17 12:33:27 -0300 2012: My proof of concept archive for the hackers ML site is still online, in case anyone has trouble downloading the patches or just wants to have the full thread handy. I was going to ping you about this, because I tried it when I wrote this message and it got stuck waiting for response. Hmm, works for me, e.g. the recently cited message: http://archives.postgresql.org/message-id/4f12f9e5.3090...@dunslane.net Now that we've migrated the website, it's time to get back to our conversations about migrating archives to your stuff too. How confident with Django are you? I've never wrote a line of Python in my life, so someone else should work on porting the web part, I'm afraid... Cheers -- Matteo Beccati Development Consulting - http://www.beccati.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
On 17/01/2012 18:10, Matteo Beccati wrote: On 17/01/2012 17:50, Alvaro Herrera wrote: Excerpts from Matteo Beccati's message of mar ene 17 12:33:27 -0300 2012: My proof of concept archive for the hackers ML site is still online, in case anyone has trouble downloading the patches or just wants to have the full thread handy. I was going to ping you about this, because I tried it when I wrote this message and it got stuck waiting for response. Hmm, works for me, e.g. the recently cited message: http://archives.postgresql.org/message-id/4f12f9e5.3090...@dunslane.net Erm... I meant http://archives.beccati.org/message-id/4f12f9e5.3090...@dunslane.net which redirects to: http://archives.beccati.org/pgsql-hackers/message/305925 for me. Cheers -- Matteo Beccati Development Consulting - http://www.beccati.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
Andrew Dunstan and...@dunslane.net writes: On 01/16/2012 05:40 PM, Alvaro Herrera wrote: It's expected that we'll get a more reasonable interface to attachments, one that will allow you to download patches separately. (Currently, attachments that have mime types other than text/plain are already downloadable separately). Are you really sure about that? My recent JSON patch is at http://archives.postgresql.org/message-id/4f12f9e5.3090...@dunslane.net. I don't see any download link for the patch there, yet my mailer set the attachment type to text/x-patch, not text/plain. Yeah, AFAICT the archives treat text/x-patch the same as text/plain. I tend to send stuff that way if I mean it primarily to be read in the email. If I'm thinking people will download and apply it, it's better to gzip the patch and pick a mime type appropriate to that, because that makes it much easier to pull the patch off the archives at need, at the cost that you can't just eyeball it in your mail reader. Anyway, I agree with the general tenor of this thread that it'd be nice to reduce the impedance mismatches a bit. Don't have any great ideas about specific ways to do that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch review for logging hooks (CF 2012-01)
Hi, Here's my review for the logging hooks patch queued for the 2012-01 CommitFest by Martin Pihlak. Submission review The patch is in context diff format and applies fine. Tests are not included and don't seem practical for this patch. More documentation would always be nice, but as other hooks aren't documented either, it seems that's acceptable. The function prototype and ErrorData structure are largely self-documenting. Usability review Do we want that? Do we already have it? The patch aims to allow custom processing of log messages. There are two slightly overlapping features already in Postgres: csvlog and syslog. Both seem to have their drawbacks; csvlog is written to disk first, and has to be processed in the background in batches. Syslog is pretty universal, but the output format is more limited, and details of a single report are split up into several lines. Also passing extra data for parsing via log_line_prefix seems hacky and failure prone. The proposal also allows more flexible filtering/instrumentation of log messages, which is not available with current methods. Does the patch actually implement that? The hooked EmitErrorReport function is responsible for calling other log handlers, so all relevant logging and context information is available. Does it follow SQL spec, or the community-agreed behavior? So far, a few people have stated that this sort of a logging hook is desirable. Nobody has opposed it so far. Feature test I tested the hook using Martin's own pg_logforward extension: https://github.com/mpihlak/pg_logforward I verified that the hook works for messages with DEBUG, LOG, NOTICE, ERROR and FATAL levels, from the backend as well as postmaster. Are there corner cases the author has failed to consider? Whether the hook is called depends on both the 'client_min_messages' and 'log_min_messages' settings because of this optimization in errstart: /* Skip processing effort if non-error message will not be output */ if (elevel ERROR !output_to_server !output_to_client) return false; This will certainly be surprising to users. I think making it depend *only* on output_to_server (and thus log_min_messages) would be more predictable. Coding review Does it follow the project coding guidelines? There's a minor whitespace problem. When declaring variables, and the data type is longer than 12 characters, just use 1 single space character to delimit the variable name, instead of tab: extern emit_log_hook_type emit_log_hook; Will it work on Windows/BSD etc? I see that other hooks are declared with PGDLLIMPORT. I presume that this is necessary on Windows: extern PGDLLIMPORT emit_log_hook_type emit_log_hook; Are the comments sufficient and accurate? I think the hook warrants a comment that, whether the messages will be seen, depends on the log_min_messages setting. PS: This is my first patch review ever, any feedback would be welcome. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On 17.01.2012 16:35, Peter Geoghegan wrote: On 16 January 2012 08:11, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I think it might be simpler if it wasn't the background writer that's responsible for driving the group commit queue, but the backends themselves. When a flush request comes in, you join the queue, and if someone else is already doing the flush, sleep until the driver wakes you up. If no-one is doing the flush yet (ie. the queue is empty), start doing it yourself. You'll need a state variable to keep track who's driving the queue, but otherwise I think it would be simpler as there would be no dependency on WAL writer. I think this replaces one problem with another. You've now effectively elevated a nominated backend to the status of an auxiliary process - do you intend to have the postmaster look after it, as with any other auxiliary process? The GroupCommitWaitForLSN() call happens within a critical section. If the process dies, you're screwed no matter what. It's not very different from the current situation where if one backend flushes the WAL, another backend will notice that once it gets the WALWriteLock, and returns quickly. wal_writer_delay is still needed for controlling how often asynchronous commits are flushed to disk. That had occurred to me of course, but has anyone ever actually tweaked wal_writer_delay with adjusting the behaviour of asynchronous commits in mind? I found it very helpful to reduce wal_writer_delay in pgbench tests, when running with synchronous_commit=off. The reason is that hint bits don't get set until the commit record is flushed to disk, so making the flushes more frequent reduces the contention on the clog. However, Simon made async commits nudge WAL writer if the WAL page fills up, so I'm not sure how relevant that experience is anymore. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add timing of buffer I/O requests
Attached is the pg_test_timing utility portion of this submission, broken out into its own patch. It's a contrib module modeled on pg_test_fsync. The documentation is still a bit rough, I'm not done with that yet. I have included an example of good timing results, switching to a bad clock source, and the resulting bad results. Code review found some formatting things to nitpick I've already fixed: non-standard brace locations and not including enough spaces in expressions were the main two. This is now referenced by the existing cryptic documentation comment around EXPLAIN ANALYZE, which says that overhead can be high because gettimeofday is slow on some systems. Since this utility measures that directly, I think it's a clear win to include it just for that purpose. The fact that there are more places coming where timing overhead matters is also true. But this existing one is already bad enough to justify shipping something to help measure/manage it in my mind. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/contrib/Makefile b/contrib/Makefile index 0c238aa..45b601c 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -35,6 +35,7 @@ SUBDIRS = \ pg_standby \ pg_stat_statements \ pg_test_fsync \ + pg_test_timing \ pg_trgm \ pg_upgrade \ pg_upgrade_support \ diff --git a/contrib/pg_test_timing/Makefile b/contrib/pg_test_timing/Makefile new file mode 100644 index 000..b8b266a --- /dev/null +++ b/contrib/pg_test_timing/Makefile @@ -0,0 +1,18 @@ +# contrib/pg_test_timing/Makefile + +PGFILEDESC = pg_test_timing - test timing overhead +PGAPPICON = win32 + +PROGRAM = pg_test_timing +OBJS = pg_test_timing.o + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_test_timing +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_test_timing/pg_test_timing.c b/contrib/pg_test_timing/pg_test_timing.c new file mode 100644 index 000..bcf3c3a --- /dev/null +++ b/contrib/pg_test_timing/pg_test_timing.c @@ -0,0 +1,157 @@ +/* + * pg_test_timing.c + * tests overhead of timing calls and their monoticity: that + * they always move forward + */ + +#include postgres_fe.h + +#include getopt_long.h +#include portability/instr_time.h + +static const char *progname; + +static int32 test_duration = 3; + +static void handle_args(int argc, char *argv[]); +static void test_timing(int32); + +int +main(int argc, char *argv[]) +{ + progname = get_progname(argv[0]); + + handle_args(argc, argv); + + test_timing(test_duration); + + return 0; +} + +static void +handle_args(int argc, char *argv[]) +{ + static struct option long_options[] = { + {duration, required_argument, NULL, 'd'}, + {NULL, 0, NULL, 0} + }; + int option; /* Command line option */ + int optindex = 0; /* used by getopt_long */ + + if (argc 1) + { + if (strcmp(argv[1], --help) == 0 || strcmp(argv[1], -h) == 0 || + strcmp(argv[1], -?) == 0) + { + printf(Usage: %s [-d DURATION]\n, progname); + exit(0); + } + if (strcmp(argv[1], --version) == 0 || strcmp(argv[1], -V) == 0) + { + puts(pg_test_timing (PostgreSQL) PG_VERSION); + exit(0); + } + } + + while ((option = getopt_long(argc, argv, d:, + long_options, optindex)) != -1) + { + switch (option) + { + case 'd': +test_duration = atoi(optarg); +break; + + default: +fprintf(stderr, Try \%s --help\ for more information.\n, + progname); +exit(1); +break; + } + } + + if (argc optind) + { + fprintf(stderr, +%s: too many command-line arguments (first is \%s\)\n, +progname, argv[optind]); + fprintf(stderr, Try \%s --help\ for more information.\n, +progname); + exit(1); + } + + if (test_duration 0) + { + printf(Testing timing overhead for %d seconds.\n, test_duration); + } + else + { + printf(Testing timing was interrupted.\n); + } +} + +static void +test_timing(int32 duration) +{ + uint64 total_time; + int64 time_elapsed = 0; + uint64 loop_count = 0; + uint64 prev, cur; + int32 diff, i, bits, found; + + instr_time start_time, end_time, temp; + + static int64 histogram[32]; + + total_time = duration 0 ? duration * 100 : 0; + + INSTR_TIME_SET_CURRENT(start_time); + cur = INSTR_TIME_GET_MICROSEC(start_time); + + while (time_elapsed total_time) + { + prev = cur; + INSTR_TIME_SET_CURRENT(temp); + cur = INSTR_TIME_GET_MICROSEC(temp); + diff = cur - prev; + + if (diff 0) + { + printf(Detected clock going backwards in time.\n); + printf(Time warp: %d microseconds\n, diff); + exit(1); + } + + bits = 0; + while (diff) + { + diff = 1; + bits++; + } + histogram[bits]++; + + loop_count++; + INSTR_TIME_SUBTRACT(temp, start_time); + time_elapsed = INSTR_TIME_GET_MICROSEC(temp); + } + + INSTR_TIME_SET_CURRENT(end_time); + +
Re: [HACKERS] Group commit, revised
On 17 January 2012 17:37, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I found it very helpful to reduce wal_writer_delay in pgbench tests, when running with synchronous_commit=off. The reason is that hint bits don't get set until the commit record is flushed to disk, so making the flushes more frequent reduces the contention on the clog. However, Simon made async commits nudge WAL writer if the WAL page fills up, so I'm not sure how relevant that experience is anymore. It's quite possible that the WAL Writer will spin continuously, if given enough work to do, with this patch. Although it's hard to tell from the graph I sent, there is a modest improvement in TPS for even a single client. See the tables in the PDF. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
On 01/17/2012 11:50 AM, Peter Eisentraut wrote: On mån, 2012-01-16 at 17:25 -0500, Greg Smith wrote: The most reasonable answer to this is for people to publish a git repo URL in addition to the official submission of changes to the list in patch form. Note that the original complaint was that for the occasional reviewer, the current system takes at least 5 partially redundant steps in two different systems. I doubt that adding a third system and more partially redundant steps it going to help that. Publishing the submission via git is an extra step for the patch submitter. If that happens, the reviewer can test just be cloning that, instead of first closing the PostgreSQL one then applying the patch. It removes the how do I fish the patch out of the archives? problem from the reviewer's side of things. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
Excerpts from Tom Lane's message of mar ene 17 14:24:05 -0300 2012: Andrew Dunstan and...@dunslane.net writes: On 01/16/2012 05:40 PM, Alvaro Herrera wrote: It's expected that we'll get a more reasonable interface to attachments, one that will allow you to download patches separately. (Currently, attachments that have mime types other than text/plain are already downloadable separately). Are you really sure about that? My recent JSON patch is at http://archives.postgresql.org/message-id/4f12f9e5.3090...@dunslane.net. I don't see any download link for the patch there, yet my mailer set the attachment type to text/x-patch, not text/plain. Yeah, AFAICT the archives treat text/x-patch the same as text/plain. Right, maybe it's text/* or something like that. I tend to send stuff that way if I mean it primarily to be read in the email. If I'm thinking people will download and apply it, it's better to gzip the patch and pick a mime type appropriate to that, because that makes it much easier to pull the patch off the archives at need, at the cost that you can't just eyeball it in your mail reader. Maybe we could find a way to convince Mhonarc to present links to download all mime parts separately, not only those that are undisplayable. Anyway, I agree with the general tenor of this thread that it'd be nice to reduce the impedance mismatches a bit. Don't have any great ideas about specific ways to do that. I'm hopeful that the migration to the Archivopteryx stuff by Matteo will improve things a bit. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
Excerpts from Matteo Beccati's message of mar ene 17 12:33:27 -0300 2012: On 16/01/2012 23:40, Alvaro Herrera wrote: Excerpts from Greg Smith's message of lun ene 16 19:25:50 -0300 2012: On 01/16/2012 03:48 PM, Josh Berkus wrote: I'll also point out that the process for *applying* a patch, if you don't subscribe to hackers and keep archives around on your personal machine for months, is also very cumbersome and error-prone. Copy and paste from a web page? Really? The most reasonable answer to this is for people to publish a git repo URL in addition to the official submission of changes to the list in patch form. It's expected that we'll get a more reasonable interface to attachments, one that will allow you to download patches separately. (Currently, attachments that have mime types other than text/plain are already downloadable separately). My proof of concept archive for the hackers ML site is still online, in case anyone has trouble downloading the patches or just wants to have the full thread handy. I was going to ping you about this, because I tried it when I wrote this message and it got stuck waiting for response. Now that we've migrated the website, it's time to get back to our conversations about migrating archives to your stuff too. How confident with Django are you? -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
On 16/01/2012 23:40, Alvaro Herrera wrote: Excerpts from Greg Smith's message of lun ene 16 19:25:50 -0300 2012: On 01/16/2012 03:48 PM, Josh Berkus wrote: I'll also point out that the process for *applying* a patch, if you don't subscribe to hackers and keep archives around on your personal machine for months, is also very cumbersome and error-prone. Copy and paste from a web page? Really? The most reasonable answer to this is for people to publish a git repo URL in addition to the official submission of changes to the list in patch form. It's expected that we'll get a more reasonable interface to attachments, one that will allow you to download patches separately. (Currently, attachments that have mime types other than text/plain are already downloadable separately). My proof of concept archive for the hackers ML site is still online, in case anyone has trouble downloading the patches or just wants to have the full thread handy. All you need to do is to swap postgresql.org with beccati.org in the message-id link: http://archives.postgresql.org/message-id/1320343602-sup-2...@alvh.no-ip.org - http://archives.beccati.org/message-id/1320343602-sup-2...@alvh.no-ip.org Cheers -- Matteo Beccati Development Consulting - http://www.beccati.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
We have some numbers for 9.2 runs with and without double writes now. We are still using the double-write patch that assumes checksums on data pages, so checksums must be turned on for double writes. The first set of runs are 50-warehouse 2-processor DBT2 60-minute run, with checkpoints every 5 minutes. Machine memory is 8G, cache size is 5G. Database size is about 9G. The disks are enterprise Fibre Channel disks, so there is good disk write-caching at the array. All runs are for virtual machines. (We expect that the virtual machine numbers would be representative of performance for non-virtual machines, but we know that we need to get non-virtual numbers as well.) orig 9.2| 9.2 + DW patch - FPW off FPW off FPW off FPW on DW on/FPW off CK off CK onCK on CK on one disk: 1557415308 15135 13337 13052 [5G shared_buffer, 8G RAM] sep log disk: 1873918134 18063 15823 16033 (First row is everything on one disk, second row is where the WAL log is on a separate disk.) So, in this case where cache is large and disks probably have write-caching, we get about same performance with full_page_write on and double-writes on. We need to run these numbers more to get a good average -- in some runs last night, double writes did better, closer to what we were seeing with 9.0 (score of 17721 instead of 16033). Note that, for one disk, there is no significant different between the original 9.2 code and the patched code with checksums (and double-writes) turned off. For two disks, there is a bigger difference (3.3%), but I'm not sure that is really significant. The second set of numbers is for a hard disk with write cache turned off, closer to internal hard disks of servers (people were quite interested in that result). These runs are for 50-warehouse 8-processor DBT2 60-minute run, with checkpoints every 5 minutes. The RAM size is 8G, and the cache size is 6G. 9.2 + DW patch --- FPW off FPW on DW on/FPW off CK onCK on CK on one disk: 1208478499766[6G shared_buffers, 8G RAM] So, here we see a performance advantage for double writes where the cache is large and the disks do not have write-caching. Presumably, the cost of fsyncing the big writes (with full pages) to the WAL log on a slow disk are traded against the fsyncs of the double writes. Third set of numbers is back to the first hardware setup, but with much smaller shared_buffers. Again, the runs are 50-warehouse 2-processor DBT2 60-minute run, with checkpoints every 5 minutes. But shared_buffers is set to 1G, so there will be a great many more dirty evictions by the backends. 9.2 + DW patch --- FPW off FPW on DW on/FPW off CK onCK on CK on one disk: 11078 103943296 [1G shared_buffers, 8G RAM] sep log disk: 13605 120153412 one disk: 773166132670 [1G shared_buffers, 2G RAM] sep log disk: 675261292722 Here we see that double writes does very badly, because of all the double writes being done for individual blocks by the backends. With the small shared cache, the backends are now writing 3 times as many blocks as the checkpointer. Clearly, the double write option would have to be completely optional, available for use for database configurations which have a well-sized cache. It would still be preferable that performance didn't have such a cliff when dirty evictions become high, so, with that in mind, I am doing some prototyping of the double-write buffer idea that folks have proposed on this thread. Happy to hear all comments/suggestions. Thanks, Dan - Original Message - From: Dan Scales sca...@vmware.com To: Heikki Linnakangas heikki.linnakan...@enterprisedb.com Cc: PG Hackers pgsql-hackers@postgresql.org, jks...@gmail.com, David Fetter da...@fetter.org Sent: Wednesday, January 11, 2012 1:25:21 PM Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums Thanks for all the comments and suggestions on the double-write patch. We are working on generating performance results for the 9.2 patch, but there is enough difference between 9.0 and 9.2 that it will take some time. One thing in 9.2 that may be causing problems with the current patch is the fact that the checkpointer and bgwriter are separated and can run at the same time (I think), and therefore will contend on the double-write file. Is there any thought that the bgwriter might be paused while the checkpointer is doing a checkpoint, since the checkpointer is doing some of the cleaning that the bgwriter wants to do anyways? The current patch (as mentioned) also may not do well if there are a lot of dirty-page
Re: [HACKERS] Why is CF 2011-11 still listed as In Progress?
On mån, 2012-01-16 at 22:00 -0500, Greg Smith wrote: Adjusting that expectation is another side to pragmatism based on recent history I think needs to be acknowledged, but is unlikely to be improved on. 9.0 shipped on September 20. 9.1 shipped on September 11. If we say the last CF of each release is unlikely to wrap up before early March each year, that's 6 months of settling time between major feature freeze and release. So far that seems to result in stable releases to be proud of, on a predictable enough yearly schedule. Well, has it? I think, it took until version 9.1.2 to have a release without major issues that you could consider for production. So do we need 8 or 10 months of settling time? Or should we release earlier, realizing that we won't get proper testing before the final release anyway? I don't know. Another concern is that we are now essentially freezing 9.2 features with at best about four weeks of production experience and feedback from 9.1. I expect that this will also contribute to dragging out the finalization of 9.2 once more. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: tracking temp files in pg_stat_database
On 20.12.2011 19:59, Tomas Vondra wrote: On 20.12.2011 11:20, Magnus Hagander wrote: 2011/12/20 Tomas Vondra t...@fuzzy.cz: I haven't updated the docs yet - let's see if the patch is acceptable at all first. Again, without having reviewed the code, this looks like a feature we'd want, so please add some docs, and then submit it for the next commitfest! I've added the docs (see the attachment) and rebased to current head. Tomas Fixed a failing regression test (check of pg_stat_database structure). Tomas diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index a12a9a2..3635c3f 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -691,6 +691,26 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re /row row + entryliteralfunctionpg_stat_get_db_temp_files/function(typeoid/type)/literal/entry + entrytypebigint/type/entry + entry + Nuber of temporary files written for the database. All temporary files are + counted, regardless of why the temporary file was created (sorting or hash + join) or file size (log_temp_file does not affect this). + /entry + /row + + row + entryliteralfunctionpg_stat_get_db_temp_bytes/function(typeoid/type)/literal/entry + entrytypebigint/type/entry + entry + Amount of data written to temporary files for the database. All temporary + files are counted, regardless of why the temporary file was created (sorting + or hash join) or file size (log_temp_file does not affect this). + /entry + /row + + row entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry entrytypebigint/type/entry entry diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 50ba20c..3a0dca3 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -574,6 +574,8 @@ CREATE VIEW pg_stat_database AS pg_stat_get_db_tuples_updated(D.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted, pg_stat_get_db_conflict_all(D.oid) AS conflicts, +pg_stat_get_db_temp_files(D.oid) AS temp_files, +pg_stat_get_db_temp_bytes(D.oid) AS temp_bytes, pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset FROM pg_database D; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 323d42b..6a2ff1d 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -286,7 +286,7 @@ static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len); static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len); static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); - +static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); /* * Public functions called from postmaster follow @@ -1339,6 +1339,29 @@ pgstat_report_recovery_conflict(int reason) pgstat_send(msg, sizeof(msg)); } + +/* + * pgstat_report_tempfile() - + * + * Tell the collector about a temporary file. + * + */ +void +pgstat_report_tempfile(size_t filesize) +{ + PgStat_MsgTempFile msg; + + if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts) + return; + + pgstat_setheader(msg.m_hdr, PGSTAT_MTYPE_TEMPFILE); + msg.m_databaseid = MyDatabaseId; + msg.m_filesize = filesize; + pgstat_send(msg, sizeof(msg)); +} + +; + /* -- * pgstat_ping() - * @@ -3185,6 +3208,10 @@ PgstatCollectorMain(int argc, char *argv[]) pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) msg, len); break; + case PGSTAT_MTYPE_TEMPFILE: + pgstat_recv_tempfile((PgStat_MsgTempFile *) msg, len); + break; + default: break; } @@ -3266,6 +3293,8 @@ pgstat_get_db_entry(Oid databaseid, bool create) result-n_conflict_snapshot = 0; result-n_conflict_bufferpin = 0; result-n_conflict_startup_deadlock = 0; + result-n_temp_files = 0; + result-n_temp_bytes = 0; result-stat_reset_timestamp = GetCurrentTimestamp(); @@ -4177,6 +4206,8 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len) dbentry-n_tuples_updated = 0; dbentry-n_tuples_deleted = 0; dbentry-last_autovac_time = 0; + dbentry-n_temp_bytes = 0; + dbentry-n_temp_files = 0; dbentry-stat_reset_timestamp =
Re: [HACKERS] [WIP] Double-write with Fast Checksums
Dan Scales sca...@vmware.com wrote: The second set of numbers is for a hard disk with write cache turned off, closer to internal hard disks of servers (people were quite interested in that result). These runs are for 50-warehouse 8-processor DBT2 60-minute run, with checkpoints every 5 minutes. The RAM size is 8G, and the cache size is 6G. 9.2 + DW patch --- FPW off FPW on DW on/FPW off CK onCK on CK on one disk: 1208478499766[6G shared_buffers, 8G RAM] So, here we see a performance advantage for double writes where the cache is large and the disks do not have write-caching. Presumably, the cost of fsyncing the big writes (with full pages) to the WAL log on a slow disk are traded against the fsyncs of the double writes. I'm very curious about what impact DW would have on big servers with write-back cache that becomes saturated, like in Greg Smith's post here: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00883.php This is a very different approach from what has been tried so far to address that issue, but when I look at the dynamics of that situation, I can't help thinking that DW is the most promising approached for improving that which I've seen suggested so far. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch review for logging hooks (CF 2012-01)
On 01/17/2012 07:35 PM, Marti Raudsepp wrote: Here's my review for the logging hooks patch queued for the 2012-01 CommitFest by Martin Pihlak. Thanks for reviewing! There's a minor whitespace problem. When declaring variables, and the data type is longer than 12 characters, just use 1 single space character to delimit the variable name, instead of tab: Fixed both in .h and .c I see that other hooks are declared with PGDLLIMPORT. I presume that this is necessary on Windows: extern PGDLLIMPORT emit_log_hook_type emit_log_hook; Indeed, fixed now. I think the hook warrants a comment that, whether the messages will be seen, depends on the log_min_messages setting. Comment added. regards, Martin *** a/src/backend/utils/error/elog.c --- b/src/backend/utils/error/elog.c *** *** 136,141 static int errordata_stack_depth = -1; /* index of topmost active frame */ --- 136,148 static int recursion_depth = 0; /* to detect actual recursion */ + /* + * Hook for intercepting log messages. Note that the hook does not get + * called for messages that are supressed by GUC settings such as + * log_min_messages. + */ + emit_log_hook_type emit_log_hook = NULL; + /* buffers for formatted timestamps that might be used by both * log_line_prefix and csv logs. */ *** *** 1276,1281 EmitErrorReport(void) --- 1283,1291 CHECK_STACK_DEPTH(); oldcontext = MemoryContextSwitchTo(ErrorContext); + if (emit_log_hook) + emit_log_hook(edata); + /* Send to server log, if enabled */ if (edata-output_to_server) send_message_to_server_log(edata); *** a/src/include/utils/elog.h --- b/src/include/utils/elog.h *** *** 327,332 typedef struct ErrorData --- 327,334 int saved_errno; /* errno at entry */ } ErrorData; + typedef void (*emit_log_hook_type)(ErrorData *edata); + extern void EmitErrorReport(void); extern ErrorData *CopyErrorData(void); extern void FreeErrorData(ErrorData *edata); *** *** 347,352 typedef enum --- 349,355 extern int Log_error_verbosity; extern char *Log_line_prefix; extern int Log_destination; + extern PGDLLIMPORT emit_log_hook_type emit_log_hook; /* Log destination bitmap */ #define LOG_DESTINATION_STDERR 1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch review for logging hooks (CF 2012-01)
Hi! On Tue, Jan 17, 2012 at 23:07, Martin Pihlak martin.pih...@gmail.com wrote: I think the hook warrants a comment that, whether the messages will be seen, depends on the log_min_messages setting. Comment added. Nice :) It seems you missed a comment, that the current implementation is also affected by client_min_messages. I think that being affected by client-specific settings is surprising. I would put the if(emit_log_hook) inside the existing if(edata-output_to_server) condition. Unless you have some reason to do it this way? Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Command Triggers
Dimitri Fontaine dimi...@2ndquadrant.fr writes: Robert Haas robertmh...@gmail.com writes: But on the flip side, I think we're generally a bit more flexible about exposing things via C than through the procedural languages. Then as Andres proposed, a new function would be available to get the value, we're not changing the trigger procedure function API in case the language is C… I've been updating my github branch with a patch that provides the parsetree to C coded command trigger functions only, as their 5th argument, of type INTERNAL (so that only C coded procs apply). https://github.com/dimitri/postgres/compare/master...command_triggers I still have some cleaning to do before to prepare the next patch version, such as documentation updating and dealing with rewrites of CHECK and DEFAULT column constraints in CREATE TABLE. I had to add support for the T_A_Const parser node, and now I'm about to see about adding support for the T_A_Expr one, but I can't help to wonder how the rewriter could work without them. What is this simple thing I'm missing here, if any? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA
On 16 January 2012 08:06, Ilya Kosmodemiansky hydrobi...@gmail.com wrote: That is quite useful feature to implement smth. like message queues based on database and so on. Now there is possibility to jump over luck of such feature in Postgres using current advisory lock implementation (pg_try_advisory_xact_lock to determine if somebody already acquired log on particular row). So Im not sure this is an urgent matter. Thanks Ilya. I knew about advisory locks but hadn't though of using them like this. I tried some simple examples using SELECT ... FROM ... WHERE pg_try_advisory_xact_lock(id) [FOR UPDATE] LIMIT 1 in transactions from a couple of different sessions and it achieves the right effect. I could imagine that in theory there might be order of evaluation subtleties in some cases where you have more things in your WHERE clause though. I guess you want the pg_try_advisory_xact_lock to be tested last after all other conditions are satisfied (ie for minimal lock contention, avoiding false positives). So I guess the question is whether it's worth implementing an explicit feature to match other RDMBSs, complement NOWAIT and avoid theoretical order of evaluation problems, or if this technique is enough. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Command Triggers
Dimitri Fontaine dimi...@2ndquadrant.fr writes: I still have some cleaning to do before to prepare the next patch version, such as documentation updating and dealing with rewrites of CHECK and DEFAULT column constraints in CREATE TABLE. I had to add support for the T_A_Const parser node, and now I'm about to see about adding support for the T_A_Expr one, but I can't help to wonder how the rewriter could work without them. It doesn't, and it shouldn't have to. If those nodes get to the rewriter then somebody forgot to apply parse analysis. What's your test case? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why is CF 2011-11 still listed as In Progress?
On Tue, Jan 17, 2012 at 3:27 PM, Peter Eisentraut pete...@gmx.net wrote: On mån, 2012-01-16 at 22:00 -0500, Greg Smith wrote: Adjusting that expectation is another side to pragmatism based on recent history I think needs to be acknowledged, but is unlikely to be improved on. 9.0 shipped on September 20. 9.1 shipped on September 11. If we say the last CF of each release is unlikely to wrap up before early March each year, that's 6 months of settling time between major feature freeze and release. So far that seems to result in stable releases to be proud of, on a predictable enough yearly schedule. Well, has it? I think, it took until version 9.1.2 to have a release without major issues that you could consider for production. So do we need 8 or 10 months of settling time? Or should we release earlier, realizing that we won't get proper testing before the final release anyway? I don't know. There's definitely some things that we're not going to catch until after final release; after a certain point, settling doesn't help much. But the general pattern for the last few releases is that after the end of the cycle we've done a sweep for open issues and found many of them. Early testing also tends to shake out a bunch of bugs in whatever the new features are. During the 9.0 cycle, it took until June to clear that backlog; during the 9.1 cycle, it took until sometime around June-July. In both cases, a large percentage of those issues were bugs introduced during the final CommitFest, because that's when nearly all the big features hit. If we froze the tree today, we likely wouldn't need more than a month to put out a good beta, but a month from now it'll take four or five months. People know when the end of the cycle is coming; just look at the number of patches in the last CommitFest of any given cycle versus the earlier ones. We've been doing a pretty good job fielding it, but it isn't perfect. Another concern is that we are now essentially freezing 9.2 features with at best about four weeks of production experience and feedback from 9.1. I expect that this will also contribute to dragging out the finalization of 9.2 once more. I don't believe that this really matters all that much; a lot of the things we're fixing have been an issue for years. Even for the patches that are improving on 9.1 features (e.g. recv and apply sync rep modes), I don't think this particular consideration is going to hold things up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE
On Jan 17, 2012, at 11:07 AM, Alvaro Herrera wrote: Excerpts from Peter Eisentraut's message of mar ene 17 13:59:57 -0300 2012: It appears that the only way to create a non-inherited CHECK constraint is using ALTER TABLE. Is there no support in CREATE TABLE planned? That looks a bit odd. There are no plans to do that AFAIR, though maybe you could convince Nikhil to write the patch to do so. That certainly doesn't meet the principle of least surprise... CREATE TABLE should support this. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 feature proposal: vacuumdb -j #
On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote: Have two logical tasks: a) A process that manages the list, and b) Child processes doing vacuums. Each time a child completes a table, it asks the parent for another one. There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh was proposing. CREATE TEMP SEQUENCE s; SELECT relname, s mod number of backends AS backend_number FROM ( SELECT relname FROM pg_class ORDER BY relpages ); Of course, having an actual scheduling process is most likely the most efficient. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On Jan 15, 2012, at 4:42 PM, Peter Geoghegan wrote: Attached is a patch that myself and Simon Riggs collaborated on. I took the group commit patch that Simon posted to the list back in November, and partially rewrote it. Forgive me if this is a dumb question, but I noticed a few places doing things like: if (...) Blah(); else { ... } Is that allowed PG formatting? I thought that if braces were required on one branch of an if they were supposed to go on both sides. Also, I didn't see any README changes in the patch... perhaps this is big enough to warrant them? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
Excerpts from Jim Nasby's message of mar ene 17 21:21:57 -0300 2012: On Jan 15, 2012, at 4:42 PM, Peter Geoghegan wrote: Attached is a patch that myself and Simon Riggs collaborated on. I took the group commit patch that Simon posted to the list back in November, and partially rewrote it. Forgive me if this is a dumb question, but I noticed a few places doing things like: if (...) Blah(); else { ... } Is that allowed PG formatting? I thought that if braces were required on one branch of an if they were supposed to go on both sides. Yeah, we even used to have pg_indent remove braces around single statements, so if you had one statement in the if branch and more around the other one, pg_indent would have left things like that anyway. (This particular pg_indent behavior got removed because it messed up PG_TRY blocks.) -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
On 01/17/2012 03:47 PM, Kevin Grittner wrote: I'm very curious about what impact DW would have on big servers with write-back cache that becomes saturated, like in Greg Smith's post here... My guess is that a percentage of the dbt-2 run results posted here are hitting that sort of problem. We just don't know which, because the results numbers posted were all the throughput numbers. I haven't figure out a way to look for cache saturation issues other than collecting all the latency information for each transaction, then graphing that out if the worst-case value is poor. It's quite possible they have that data but just didn't post just to keep the summary size managable, since dbt-2 collects a lot of information. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
Excerpts from Andrew Dunstan's message of mar ene 17 13:50:20 -0300 2012: On 01/16/2012 05:40 PM, Alvaro Herrera wrote: It's expected that we'll get a more reasonable interface to attachments, one that will allow you to download patches separately. (Currently, attachments that have mime types other than text/plain are already downloadable separately). Are you really sure about that? My recent JSON patch is at http://archives.postgresql.org/message-id/4f12f9e5.3090...@dunslane.net. I don't see any download link for the patch there, yet my mailer set the attachment type to text/x-patch, not text/plain. I tweaked the Mhonarc config and now this attachment (as well as many others) is shown as a downloadable link. Please give it a look. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On Tue, Jan 17, 2012 at 12:37 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I found it very helpful to reduce wal_writer_delay in pgbench tests, when running with synchronous_commit=off. The reason is that hint bits don't get set until the commit record is flushed to disk, so making the flushes more frequent reduces the contention on the clog. However, Simon made async commits nudge WAL writer if the WAL page fills up, so I'm not sure how relevant that experience is anymore. There's still a small but measurable effect there in master. I think we might be able to make it fully auto-tuning, but I don't think we're fully there yet (not sure how much this patch changes that equation). I suggested a design similar to the one you just proposed to Simon when he originally suggested this feature. It seems that if the WAL writer is the only one doing WAL flushes, then there must be some IPC overhead - and context switching - involved whenever WAL is flushed. But clearly we're saving something somewhere else, on the basis of Peter's results, so maybe it's not worth worrying about. It does seem pretty odd to have all the regular backends going through the WAL writer and the auxiliary processes using a different mechanism, though. If we got rid of that, maybe WAL writer wouldn't even require a lock, if there's only one process that can be doing it at a time. What happens in standalone mode? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 feature proposal: vacuumdb -j #
On 01/17/2012 07:09 PM, Jim Nasby wrote: On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote: Have two logical tasks: a) A process that manages the list, and b) Child processes doing vacuums. Each time a child completes a table, it asks the parent for another one. There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh was proposing. CREATE TEMP SEQUENCE s; SELECT relname, s modnumber of backends AS backend_number FROM ( SELECT relname FROM pg_class ORDER BY relpages ); Of course, having an actual scheduling process is most likely the most efficient. We already have a model for this in parallel pg_restore. It would probably not be terribly hard to adapt to parallel vacuum. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On Tue, Jan 17, 2012 at 12:06 PM, Peter Eisentraut pete...@gmx.net wrote: On mån, 2012-01-16 at 14:46 -0500, Robert Haas wrote: On Mon, Jan 16, 2012 at 2:06 PM, Peter Eisentraut pete...@gmx.net wrote: On mån, 2012-01-16 at 11:17 -0500, Robert Haas wrote: I don't see how setting indisvalid to false helps with this, because IIUC when a session sees indisvalid = false, it is supposed to avoid using the index for queries but still make new index entries when a write operation happens - but to drop an index, I think you'd need to get into a state where no one was using the index for anything at all. ISTM that one would need to set indisready to false instead. Maybe we should set both to false? Well, ready = false and valid = true doesn't make any sense. There is only just-created - ready - valid. We might as well convert that to a single char column, as you had indicated in your earlier email. But that's independent of the proposed patch. Sure, but the point is that I think if you want everyone to stop touching the index, you ought to mark it both not-valid and not-ready, which the current patch doesn't do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql \timing vs failed statements
On Tue, Jan 17, 2012 at 10:01 AM, Magnus Hagander mag...@hagander.net wrote: Right now, psql \timing output only gives output for successful queries. Is there any actual reason for this, or just a it happened? In particular,I just had a very long run of a CREATE UNIQUE INDEX fail pretty far in - and I would've really liked to have timing output for that one even though it failed. Thus - if I were to change psql to output timing on failed queries as well, will anybody object? ;) I won't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch review for logging hooks (CF 2012-01)
On Wed, Jan 18, 2012 at 2:35 AM, Marti Raudsepp ma...@juffo.org wrote: Are there corner cases the author has failed to consider? The hook can be executed by various processes since it's in EmitErrorReport(). OTOH, log messages are written to the log file by one process like syslogger (if you use csvlog or stderr) or syslog process (if you use syslog). So ISTM that there is no guarantee that the order of log messages processed by the hook is same as that of messages written to the log file. For example, imagine the case where two backends call EmitErrorReport() at the same time. Is this OK? If not, the hook might need to be in syslogger. EmitErrorReport() can be called before shared library like Martin's pg_logforward is preloaded. Which means that the hook may miss some log messages. Is this OK? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum rate limit in KBps
On Jan 15, 2012, at 8:13 PM, Greg Smith wrote: On 01/15/2012 04:17 PM, Heikki Linnakangas wrote: I think it makes more sense to use the max read rate as the main knob, rather than write rate. That's because the max read rate is higher than the write rate, when you don't need to dirty pages. Or do you think saturating the I/O system with writes is so much bigger a problem than read I/O that it makes more sense to emphasize the writes? I haven't had the I/O rate logging available for long enough to have a good feel for which is more important to emphasize. I'm agnostic on this. I'd have no problem accepting the argument that exposing the larger of the two rates--which is the read one--makes for a cleaner UI. Or that it is the one more like other knobs setting precedents here. Could we expose both? On our systems writes are extremely cheap... we don't do a ton of them (relatively speaking), so they tend to just fit into BBU cache. Reads on the other hard are a lot more expensive, at least if they end up actually hitting disk. So we actually set page_dirty and page_hit the same. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] automating CF submissions (was xlog location arithmetic)
Excerpts from Alvaro Herrera's message of mar ene 17 22:23:13 -0300 2012: Excerpts from Andrew Dunstan's message of mar ene 17 13:50:20 -0300 2012: On 01/16/2012 05:40 PM, Alvaro Herrera wrote: It's expected that we'll get a more reasonable interface to attachments, one that will allow you to download patches separately. (Currently, attachments that have mime types other than text/plain are already downloadable separately). Are you really sure about that? My recent JSON patch is at http://archives.postgresql.org/message-id/4f12f9e5.3090...@dunslane.net. I don't see any download link for the patch there, yet my mailer set the attachment type to text/x-patch, not text/plain. I tweaked the Mhonarc config and now this attachment (as well as many others) is shown as a downloadable link. Please give it a look. Hm, I notice it works almost every patch I've checked, except the ones from Tom such as this one: http://archives.postgresql.org/message-id/4643.1326776...@sss.pgh.pa.us The problem is that this one doesn't have the Content-Disposition: attachment line in the MIME header. I don't know what we can do about it. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] sepgsql's DROP Permission checks
On Tue, Jan 17, 2012 at 10:55 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: Do I modify the patch to place object-access-hook on deleteOneObject (probably, it is the best position to track actual deletion)? One problem is case of deletion of columns by ALTER TABLE. It just marks attisdropped flag; without removing catalog entry. Do we ought to put this hook on ATExecDropColumn exceptionally? +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)
On Sun, Jan 8, 2012 at 10:52 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: BTW, can you also resubmit the leakproof stuff as a separate patch for the last CF? Want to make sure we get that into 9.2, if at all possible. Yes, it shall be attached on the next message. The attached patch adds LEAKPROOF attribute to pg_proc; that enables DBA to set up obviously safe functions to be pushed down into sub-query even if it has security-barrier attribute. We assume this LEAKPROOF attribute shall be applied on operator functions being used to upgrade execute plan from Seq-Scan to Index-Scan. The default is without-leakproof attribute on creation of functions, and it requires superuser privilege to switch on. The create_function_3 regression test fails for me with this applied: *** /Users/rhaas/pgsql/src/test/regress/expected/create_function_3.out 2012-01-17 22:09:01.0 -0500 --- /Users/rhaas/pgsql/src/test/regress/results/create_function_3.out 2012-01-17 22:14:48.0 -0500 *** *** 158,165 'functext_E_2'::regproc); proname| proleakproof --+-- - functext_e_2 | t functext_e_1 | t (2 rows) -- list of built-in leakproof functions --- 158,165 'functext_E_2'::regproc); proname| proleakproof --+-- functext_e_1 | t + functext_e_2 | t (2 rows) -- list of built-in leakproof functions *** *** 476,485 'functext_F_4'::regproc); proname| proisstrict --+- - functext_f_1 | f functext_f_2 | t functext_f_3 | f functext_f_4 | t (4 rows) -- Cleanups --- 476,485 'functext_F_4'::regproc); proname| proisstrict --+- functext_f_2 | t functext_f_3 | f functext_f_4 | t + functext_f_1 | f (4 rows) -- Cleanups The new regression tests I just committed need updating as well. Instead of contains_leakable_functions I suggest contains_leaky_functions or contains_non_leakproof_functions, because leakable isn't really a word (although I know what you mean). The design of this function also doesn't seem very future-proof. If someone adds a new node type that can contain a function call, and forgets to add it here, then we've got a subtle security hole. Is there some reasonable way to design this so that we assume everything's dangerous except for those things we know are safe, rather than the reverse? I think you need to do a more careful check of which functions you're marking leakproof - e.g. timestamp_ne_timestamptz isn't, at least according to my understanding of the term. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE
It appears that the only way to create a non-inherited CHECK constraint is using ALTER TABLE. Is there no support in CREATE TABLE planned? That looks a bit odd. There are no plans to do that AFAIR, though maybe you could convince Nikhil to write the patch to do so. That certainly doesn't meet the principle of least surprise... CREATE TABLE should support this. Well, the above was thought about during the original discussion and eventually we felt that CREATE TABLE already has other issues as well, so not having this done as part of creating a table was considered acceptable then: http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html#a4647144 But, let me have a stab at it when I get some free cycles. Regards, Nikhils
Re: [HACKERS] psql \timing vs failed statements
On Tue, Jan 17, 2012 at 04:01:23PM +0100, Magnus Hagander wrote: Thus - if I were to change psql to output timing on failed queries as well, will anybody object? ;) +1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Client Messages
On Fri, Jan 6, 2012 at 1:38 AM, Jim Mlodgenski jimm...@gmail.com wrote: I have a need to send banner messages to a psql client that I can set on the server and will be displayed on any psql client that connects to the database. This would be mostly used as an additional indicator to which database you are connecting, but could also be used by people to force their users to see an security message when connecting to the database. The attached patch will allow you to execute ALTER DATABASE postgres SET client_message=E'\nBEWARE: You are connecting to a production database. If you do anything to\n bring this server down, you will be destroyed by your supreme overlord.\n\n'; And then when you connect to psql, you will see: [e3@workstation bin]$ ./psql -U user1 postgres psql (9.2devel) BEWARE: You are connecting to a production database. If you do anything to bring this server down, you will be destroyed by your supreme overlord. Type help for help. postgres= Any feedback is welcome. Adding new GUC parameter only for the purpose of warning psql users seems overkill to me. Basically we try to reduce the number of GUC parameters to make a configuration easier to a user, so I don't think that it's good idea to add new GUC for such a small benefit. Instead, how about using .psqlrc file and writing a warning message in it by using \echo command? Anyway, I found one problem in the patch. The patch defines client_message as PGC_USERSET parameter, which means that any psql can falsify a warning message, e.g., by setting the environment variable PGOPTIONS to -c client_message=hoge. This seems to be something to avoid from security point of view. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding shutdown checkpoint at failover
On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs si...@2ndquadrant.com wrote: When I say skip the shutdown checkpoint, I mean remove it from the critical path of required actions at the end of recovery. We can still have a normal checkpoint kicked off at that time, but that no longer needs to be on the critical path. Any problems foreseen? If not, looks like a quick patch. Patch attached for discussion/review. This feature is what I want, and very helpful to shorten the failover time in streaming replication. Here are the review comments. Though I've not checked enough whether this feature works fine in all recovery patterns yet. LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery(). LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery(). XLOG_END_OF_RECOVERY record is written to the WAL file with new assigned timeline ID. But it must be written to the WAL file with old one. Otherwise, when re-entering a recovery after failover, we cannot find XLOG_END_OF_RECOVERY record at all. Before XLOG_END_OF_RECOVERY record is written, RmgrTable[rmid].rm_cleanup() might write WAL records. They also should be written to the WAL file with old timeline ID. When recovery target is specified, we cannot write new WAL to the file with old timeline because which means that valid WAL records in it are overwritten with new WAL. So when recovery target is specified, ISTM that we cannot skip end of recovery checkpoint. Or we might need to save all information about timelines in the database cluster instead of writing XLOG_END_OF_RECOVERY record, and use it when re-entering a recovery. LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise, what if the server crashes after new timeline history file is created and recovery.conf is removed, but before XLOG_END_OF_RECOVERY record has not been flushed to the disk yet? During recovery, when we replay XLOG_END_OF_RECOVERY record, we should close the currently-opened WAL file and read the WAL file with the timeline which XLOG_END_OF_RECOVERY record indicates. Otherwise, when re-entering a recovery with old timeline, we cannot reach new timeline. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Fix Leaky View Problem
On Sun, Jan 8, 2012 at 10:32 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I guess you concerned about that expected/select_views_1.out is patched, not expected/select_views.out. I'm not sure the reason why regression test script tries to make diff between results/select_views and expected/select_views_1.out. select_views.out and select_views_1.out are alternate expected output files. The regression tests pass if the actual output matches either one. Thus, you have to patch both. It was new for me. The attached patch updates both of the expected files, however, I'm not certain whether select_view.out is suitable, or not, because my results/select_view.out matched with expected/select_view_1.out. Committed. We'll see what the buildfarm thinks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers