Re: [HACKERS] Re: Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls
On 21.07.2013 08:41, Jeff Davis wrote: (For that matter, am I not supposed to commit between 'fests? Or is it still an option for me to finish up with this after I get back even if we close the CF?) It's totally OK to commit stuff between 'fests. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls
(For that matter, am I not supposed to commit between 'fests? Or is it still an option for me to finish up with this after I get back even if we close the CF?) The idea of the CommitFests is to give committers some *time off* between them. If a committer wants to commit stuff when it's not a CF, that's totally up to them. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)
On Wed, 2013-07-17 at 13:43 -0400, Alvaro Herrera wrote: Tom Lane escribió: My feeling about this code is that the reason we print the infomask in hex is so you can see exactly which bits are set if you care, and that the rest of the line ought to be designed to interpret the bits in as reader-friendly a way as possible. So I don't buy the notion that we should just print out a name for each bit that's set. I'd rather replace individual bit names with items like LOCKED_FOR_KEY_SHARE, LOCKED_FOR_SHARE, etc in cases where you have to combine multiple bits to understand the meaning. Okay, that's what I've been saying all along so I cannot but agree. I haven't reviewed Jeff's patch lately; Jeff, does Tom's suggestion need some more new code, and if so are you open to doing this work, or shall I? At first glance it seems like a pretty trivial change. I'm going on vacation tomorrow and unfortunately I haven't had a chance to look at this. Pgfoundry CVS is down, so I can't see whether it's already been committed or not. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [9.4 CF 1] And then there were 5
... well, 11 actually. We are down to 3 patches needing review, 2 waiting on author, and 6 waiting on commit. Whatever folks can do to close out this commitfest is very welcome! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Mag Gam
http://outlet.vanlooken.com/qqwcl/yqlivfckw.rcxaekidbyraanhw Mag Gam 7/21/2013 7:23:01 AM -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] InvokeObjectPostAlterHook() vs. CommandCounterIncrement()
On Jul 21, 2013 4:06 AM, Noah Misch n...@leadboat.com wrote: If these hooks will need to apply to a larger operation, I think that mandates a different means to reliably expose the before/after object states. I haven't checked the code to see how it would fit the API, but what about taking a snapshot before altering and passing this to the hook. Would there be other issues besides performance? If the snapshot is taken only when there is a hook present then the performance can be fixed later. Regards, Ants Aasma
[HACKERS] Wal sync odirect
hi, list. there are my proposal. i would like to tell about odirect in wal sync in wal_level is higher than minimal. i think in my case when wal traffic is up to 1gb per 2-3 minutes but discs hardware with 2gb bbu cache (or maybe ssd under wal) - there would be better if wall traffic could not harm os memory eviction. and i do not use streaming. my archive command may read wal directly without os cache. just opinion, i have not done any tests yet. but i am still under the some memory eviction anomaly. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Auto explain target tables
hi, list, again. the next proposal into auto explain. one would be happy if could set list of target tables and indexes. sometimes it is very hard to detect who is using your indexes. but turn total logging on under thousands transactions per seconds is not seems like nice idea couse size of resulting log files (cpu utilization might not be so critical) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Preventing tuple-table leakage in plpgsql
On Thu, Jul 11, 2013 at 09:14:38PM -0400, Chad Wagner wrote: It looks like to me when AtEOSubXact_SPI is called the _SPI_current-connectSubId is always 1 (since it is only set when SPI_connect is called, which is only once for plpgsql), but the CurrentSubTransactionId is incremented each time a subtransaction is started. Right. AtEOSubXact_SPI() cleans up any SPI connections originating in the ending subtransaction. It leaves alone connections from higher subtransaction levels; SPI has no general expectation that those have lost relevance. As a result, the memory for procCxt is only freed when I presume the TopTransaction is aborted or committed. In your code from bug #8279, I expect it to be freed when the DO block exits. The backend might not actually shrink then, but repeated calls to a similar DO block within the same transaction should not cause successive increases in the process's memory footprint. Should SPI_connect be called again after the subtransaction is created? And SPI_finish before the subtransaction is committed or aborted? Hmm. An SPI_push()+SPI_connect() every time PL/pgSQL starts a subtransaction would be another way to fix it, yes. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Preventing tuple-table leakage in plpgsql
Noah Misch n...@leadboat.com writes: On Thu, Jul 11, 2013 at 09:14:38PM -0400, Chad Wagner wrote: Should SPI_connect be called again after the subtransaction is created? And SPI_finish before the subtransaction is committed or aborted? Hmm. An SPI_push()+SPI_connect() every time PL/pgSQL starts a subtransaction would be another way to fix it, yes. That sounds like a dangerous idea to me. The procedure would then be working actively with queries from two different SPI levels, which I'm pretty sure would cause issues. It's possible that plpgsql's SPI access is sufficiently lexically-local that statements within the BEGIN block couldn't use any SPI resources created by statements outside it nor vice versa. But then again maybe not, and in any case we couldn't imagine that that would be a workable restriction for non-plpgsql scenarios. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Preventing tuple-table leakage in plpgsql
On Fri, Jul 19, 2013 at 07:34:14PM -0400, Tom Lane wrote: Noah Misch n...@leadboat.com writes: On Fri, Jul 05, 2013 at 02:47:06PM -0400, Tom Lane wrote: So I'm inclined to propose that SPI itself should offer some mechanism for cleaning up tuple tables at subtransaction abort. We could just have it automatically throw away tuple tables made in the current subtransaction, or we could allow callers to exercise some control, perhaps by calling a function that says don't reclaim this tuple table automatically. I'm not sure if there's any real use-case for such a call though. I suppose that would be as simple as making spi_dest_startup() put the tuptabcxt under CurTransactionContext? The function to prevent reclamation would then just do a MemoryContextSetParent(). I experimented with this, and found out that it's not quite that simple. In a SPI procedure that hasn't created a subtransaction (eg, a plpgsql function without an exception block), if we attach tuple tables to the outer transaction's CurTransactionContext then they fail to go away during SPI_finish(), creating leaks in code that was previously correct. However, we can use your idea when running inside a subtransaction, while still attaching the tuple table to the procedure's own procCxt when no subtransaction is involved. The attached draft patch does it that way, and successfully resolves the complained-of leakage case. I like this better than what I originally had in mind, because it produces no change in semantics in the case where a SPI procedure doesn't create any subtransactions, which I imagine is at least 99.44% of third-party SPI-using code. Reasonable enough. Code that does use subtransactions will need to be more careful than before to manually free tuple tables in the non-error case. Failure to do so has been creating a leak that lasts until SPI_finish(), but it will now be able to cause a transaction-lifespan leak. patch's changes to remove SPI_freetuptable() calls in plpy_cursorobject.c are not actually necessary for correctness, it's just that we no longer need them. If PLy_spi_subtransaction_commit() were to throw an error (granted, unlikely), would we not reference freed memory at those code sites as they stand today? Unfortunately, the change in pl_exec.c *is* necessary to avoid a coredump, because that call was being made after rolling back the subxact. Brief search for similar patterns in external PLs: plr - no subtransaction use plv8 - no SPI_freetuptable() plphp - uses both, but usage looks compatible pljava - calls SPI_freetuptable(SPI_tuptable), but never a tuptable pointer it stored away. Should be compatible, then. All in all, the risk of breaking anything outside core code seems very small, and I'm inclined to think that we don't need to provide an API function to reparent a tuple table. But having said that, I'm also inclined to not back-patch this further than 9.3, just in case. I wouldn't be confident in back-patching further than that. It's not hard to imagine a non-core PL needing a compensating change like the one you made to PL/pgSQL. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] InvokeObjectPostAlterHook() vs. CommandCounterIncrement()
On Sun, Jul 21, 2013 at 11:44:51AM +0300, Ants Aasma wrote: On Jul 21, 2013 4:06 AM, Noah Misch n...@leadboat.com wrote: If these hooks will need to apply to a larger operation, I think that mandates a different means to reliably expose the before/after object states. I haven't checked the code to see how it would fit the API, but what about taking a snapshot before altering and passing this to the hook. Would there be other issues besides performance? If the snapshot is taken only when there is a hook present then the performance can be fixed later. That would work. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Preventing tuple-table leakage in plpgsql
Noah Misch n...@leadboat.com writes: On Fri, Jul 19, 2013 at 07:34:14PM -0400, Tom Lane wrote: However, we can use your idea when running inside a subtransaction, while still attaching the tuple table to the procedure's own procCxt when no subtransaction is involved. The attached draft patch does it that way, and successfully resolves the complained-of leakage case. I like this better than what I originally had in mind, because it produces no change in semantics in the case where a SPI procedure doesn't create any subtransactions, which I imagine is at least 99.44% of third-party SPI-using code. Reasonable enough. Code that does use subtransactions will need to be more careful than before to manually free tuple tables in the non-error case. Failure to do so has been creating a leak that lasts until SPI_finish(), but it will now be able to cause a transaction-lifespan leak. Hmm ... good point. The other plan I'd been considering was to add explicit tracking inside spi.c of all tuple tables created within the current procedure, and then have AtEOSubXact_SPI flush any that were created inside a failed subxact. The tables would still be children of the procCxt and thus could not be leaked past SPI_finish. When you suggested attaching to subtransaction contexts I thought that would let us get away without this additional bookkeeping logic, but maybe we should bite the bullet and add the extra logic. A change that's meant to remove leak risks really shouldn't be introducing other, new leak risks. (An additional advantage is we could detect attempts to free the same tuptable more than once, which would be a good thing ...) patch's changes to remove SPI_freetuptable() calls in plpy_cursorobject.c are not actually necessary for correctness, it's just that we no longer need them. If PLy_spi_subtransaction_commit() were to throw an error (granted, unlikely), would we not reference freed memory at those code sites as they stand today? Hm, possibly, depending on just when the error was thrown. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] REINDEX checking of index constraints
Historically, REINDEX would always revalidate any uniqueness enforced by the index. An EDB customer reported that this is not happening, and indeed I broke it way back in commit 8ceb24568054232696dddc1166a8563bc78c900a. Specifically, REINDEX TABLE and REINDEX DATABASE no longer revalidate constraints, but REINDEX INDEX still does so. As a consequence, REINDEX INDEX is the only form of REINDEX that fixes a failed CREATE INDEX CONCURRENTLY. Attached patch just restores the old behavior. Would it be worth preserving the ability to fix an index consistency problem with a REINDEX independent from related heap consistency problems such as duplicate keys? Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com *** a/src/backend/commands/indexcmds.c --- b/src/backend/commands/indexcmds.c *** *** 1768,1774 ReindexTable(RangeVar *relation) heapOid = RangeVarGetRelidExtended(relation, ShareLock, false, false, RangeVarCallbackOwnsTable, NULL); ! if (!reindex_relation(heapOid, REINDEX_REL_PROCESS_TOAST)) ereport(NOTICE, (errmsg(table \%s\ has no indexes, relation-relname))); --- 1768,1776 heapOid = RangeVarGetRelidExtended(relation, ShareLock, false, false, RangeVarCallbackOwnsTable, NULL); ! if (!reindex_relation(heapOid, ! REINDEX_REL_PROCESS_TOAST | ! REINDEX_REL_CHECK_CONSTRAINTS)) ereport(NOTICE, (errmsg(table \%s\ has no indexes, relation-relname))); *** *** 1884,1890 ReindexDatabase(const char *databaseName, bool do_system, bool do_user) StartTransactionCommand(); /* functions in indexes may want a snapshot set */ PushActiveSnapshot(GetTransactionSnapshot()); ! if (reindex_relation(relid, REINDEX_REL_PROCESS_TOAST)) ereport(NOTICE, (errmsg(table \%s.%s\ was reindexed, get_namespace_name(get_rel_namespace(relid)), --- 1886,1894 StartTransactionCommand(); /* functions in indexes may want a snapshot set */ PushActiveSnapshot(GetTransactionSnapshot()); ! if (reindex_relation(relid, ! REINDEX_REL_PROCESS_TOAST | ! REINDEX_REL_CHECK_CONSTRAINTS)) ereport(NOTICE, (errmsg(table \%s.%s\ was reindexed, get_namespace_name(get_rel_namespace(relid)), *** a/src/test/regress/expected/create_index.out --- b/src/test/regress/expected/create_index.out *** *** 2298,2306 COMMIT; BEGIN; CREATE INDEX std_index on concur_heap(f2); COMMIT; ! -- check to make sure that the failed indexes were cleaned up properly and the ! -- successful indexes are created properly. Notably that they do NOT have the ! -- invalid flag set. \d concur_heap Table public.concur_heap Column | Type | Modifiers --- 2298,2310 BEGIN; CREATE INDEX std_index on concur_heap(f2); COMMIT; ! -- Failed builds are left invalid by VACUUM FULL, fixed by REINDEX ! VACUUM FULL concur_heap; ! REINDEX TABLE concur_heap; ! ERROR: could not create unique index concur_index3 ! DETAIL: Key (f2)=(b) is duplicated. ! DELETE FROM concur_heap WHERE f1 = 'b'; ! VACUUM FULL concur_heap; \d concur_heap Table public.concur_heap Column | Type | Modifiers *** *** 2316,2321 Indexes: --- 2320,2341 concur_index5 btree (f2) WHERE f1 = 'x'::text std_index btree (f2) + REINDEX TABLE concur_heap; + \d concur_heap + Table public.concur_heap + Column | Type | Modifiers + +--+--- + f1 | text | + f2 | text | + Indexes: + concur_index2 UNIQUE, btree (f1) + concur_index3 UNIQUE, btree (f2) + concur_heap_expr_idx btree ((f2 || f1)) + concur_index1 btree (f2, f1) + concur_index4 btree (f2) WHERE f1 = 'a'::text + concur_index5 btree (f2) WHERE f1 = 'x'::text + std_index btree (f2) + -- -- Try some concurrent index drops -- *** a/src/test/regress/sql/create_index.sql --- b/src/test/regress/sql/create_index.sql *** *** 721,730 BEGIN; CREATE INDEX std_index on concur_heap(f2); COMMIT; ! -- check to make sure that the failed indexes were cleaned up properly and the ! -- successful indexes are created
Re: [HACKERS] REINDEX checking of index constraints
Noah, Attached patch just restores the old behavior. Would it be worth preserving the ability to fix an index consistency problem with a REINDEX independent from related heap consistency problems such as duplicate keys? I would love to have two versions of REINDEX, one which validated and one which didn't. Maybe a ( validate off ) type check? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REINDEX checking of index constraints
On 07/21/2013 11:30 AM, Josh Berkus wrote: Noah, Attached patch just restores the old behavior. Would it be worth preserving the ability to fix an index consistency problem with a REINDEX independent from related heap consistency problems such as duplicate keys? I would love to have two versions of REINDEX, one which validated and one which didn't. Maybe a ( validate off ) type check? Cancel this. I just did some tests, and there amount of time required for the validation (at least, in simple two-column table test) is 10% of the time required to reindex in general. At that difference, we don't need two options. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: template-ify (binary) extensions
Salut Dimitri, On 07/20/2013 01:23 AM, Dimitri Fontaine wrote: Markus Wanner mar...@bluegap.ch writes: - per-installation (not even per-cluster) DSO availability If you install PostGIS 1.5 on a system, then it's just impossible to bring another cluster (of the same PostgreSQL major version), let On Debian, that should be well possible. Certainly installing Postgres 9.1 w/ postgis-1.5 in parallel to Postgres 9.2 w/ postgis-2.0 is. I designed it to be. I think I'm misunderstanding the problem statement, here. (of the same PostgreSQL major version) Not sure what the issue is, here, but I agree that should be possible. Can CREATE EXTENSION check if the standbys have the extension installed? And refuse creation, if they don't? No, because we don't register standbies so we don't know who they are, and also because some things that we see connected and using the replication protocol could well be pg_basebackup or pg_receivexlog. Can the standby check? In any case, these seem to be problems we can solve without affecting security. Also, it's possible that the standby is only there for High Availability purposes and runs no user query. Requiring the sysadmin to install the extensions there, too, seems justified to me. Sounds like good advice, anyways. I'm sure you are aware that even without this clear separation of roles, the guarantee means we provide an additional level of security against attackers. Given lo_import() to upload a file from the client to the server then LOAD with the absolute path where the file ended up imported (or any untrusted PL really), this argument carries no sensible weight in my opinion. lo_import() won't write a file for LOAD to load. An untrusted PL (or any other extension allowing the superuser to do that) is currently required to do that. Or to put it another way: Trusted PLs exist for a good reason. And some people just value security a lot and want that separation of roles. None the less, the safe by default has served us well, I think. That's true. We need to consider carefully the proposal at hand though. It's all about allowing the backend to automatically load a file that it finds within its own $PGDATA so that we can have per-cluster and per-database modules (DSO files). As someone mentioned previously, $PGDATA may well be mounted noexec, so that seems to be a bad choice. The only difference with before is the location where the file is read from, and the main security danger comes from the fact that we used to only consider root-writable places and now propose to consider postgres bootstrap user writable places. FWIW, I only proposed to let postgres check write permissions on libraries it loads. IIUC we don't currently do that, yet. And Postgres happily loads a world-writable library, ATM. Having the modules in different places in the system when it's a template and when it's instanciated allows us to solve a problem I forgot to list: - upgrading an extension at the OS level Once you've done that, any new backend will load the newer module (DSO file), so you have to be real quick if installing an hot fix in production and the SQL definition must be changed to match the new module version… I agree, that's a problem. Alternatively, we could solve that problem the other way around: Rather than template-ify the DSO, we could instead turn the objects created by the SQL scripts into something that's more linked to the script. Something that would reload as soon as the file on disk changes. (Note how this would make out-of-line extensions a lot closer to the in-line variant your recent patch adds? With the dependency between template and instantiation?) With the ability to instanciate the module in a per-cluster per-database directory within $PGDATA the new version of the DSO module would only put in place and loaded at ALTER EXTENSION UPDATE time. I'm still ok with allowing to fix those problems only when a security option that defaults to 'false' has been switched to 'true', by the way, so that it's an opt-in, Okay, good. For the issues you raised, I'd clearly prefer fixes that maintain current security standards, though. but I will admit having a hard time swallowing the threat model we're talking about… An attacker having access to a libpq connection with superuser rights cannot currently run arbitrary native code. He can try a DOS by exhausting system resources, but that's pretty far from being invisible. Or he can delete valuable data. Maybe other nasty things. But he should not be able to gain root access and remove its traces. Dropping this barrier by installing an untrusted PL (or equally insecure extensions), an attacker with superuser rights can trivially gain root. Of course, an attacker shouldn't gain superuser rights in the first place. But if he did, you better stop him right there with yet another fence. It's extensions that undermine the
Re: [HACKERS] REINDEX checking of index constraints
On 07/21/2013 11:30 AM, Josh Berkus wrote: Noah, Attached patch just restores the old behavior. Would it be worth preserving the ability to fix an index consistency problem with a REINDEX independent from related heap consistency problems such as duplicate keys? I would love to have two versions of REINDEX, one which validated and one which didn't. Maybe a ( validate off ) type check? Cancel this. I just did some tests, and there amount of time required for the validation (at least, in simple two-column table test) is 10% of the time required to reindex in general. At that difference, we don't need two options. Unless you're asking if we want a command to check the index validity without rebuilding it? That might be more valuable ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: template-ify (binary) extensions
On 07/21/2013 10:30 PM, Markus Wanner wrote: but I will admit having a hard time swallowing the threat model we're talking about… An attacker having access to a libpq connection with superuser rights cannot currently run arbitrary native code. He can try a DOS by exhausting system resources, but that's pretty far from being invisible. Or he can delete valuable data. Maybe other nasty things. But he should not be able to gain root access and remove its traces. Dropping this barrier by installing an untrusted PL (or equally insecure extensions), an attacker with superuser rights can trivially gain root. Could you elaborate ? This is equivalent to claiming that any linux user can trivially gain root. If the sysadmin wants to disallow arbitrary execution of native code to postgres (the process), any kind of embedded compiler likely is equally unwelcome. You already mentioned untrusted PL languages, and I don't see any difference in between offering PL/pythonu and PL/C on security grounds, really. I agree. However, this also means that any kind of solution it offers is not a good one for the security conscious sysadmin. This is usually the case with a security conscious sysadmin - they very seldom want to install anything. A cloud style solution to this problem is installing the whole PostgreSQL host in its own VM and deklegate all security to developers ;) Regards Markus Wanner -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Greg, Yes, I already took at look at it briefly. The updates move in the right direction, but I can edit them usefully before commit. I'll have that done by tomorrow and send out a new version. I'm hopeful that v18 will finally be the one that everyone likes. Have you done it? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith g...@2ndquadrant.com wrote: On 7/20/13 4:48 AM, didier wrote: With your tests did you try to write the hot buffers first? ie buffers with a high refcount, either by sorting them on refcount or at least sweeping the buffer list in reverse? I never tried that version. After a few rounds of seeing that all changes I tried were just rearranging the good and bad cases, I got pretty bored with trying new changes in that same style. by writing to the OS the less likely to be recycle buffers first it may have less work to do at fsync time, hopefully they have been written by the OS background task during the spread and are not re-dirtied by other backends. That is the theory. In practice write caches are so large now, there is almost no pressure forcing writes to happen until the fsync calls show up. It's easily possible to enter the checkpoint fsync phase only to discover there are 4GB of dirty writes ahead of you, ones that have nothing to do with the checkpoint's I/O. Backends are constantly pounding the write cache with new writes in situations with checkpoint spikes. The writes and fsync calls made by the checkpoint process are only a fraction of the real I/O going on. The volume of data being squeezed out by each fsync call is based on total writes to that relation since the checkpoint. That's connected to the writes to that relation happening during the checkpoint, but the checkpoint writes can easily be the minority there. It is not a coincidence that the next feature I'm working on attempts to quantify the total writes to each 1GB relation chunk. That's the most promising path forward on the checkpoint problem I've found. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
Hi, On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith g...@2ndquadrant.com wrote: On 7/20/13 4:48 AM, didier wrote: That is the theory. In practice write caches are so large now, there is almost no pressure forcing writes to happen until the fsync calls show up. It's easily possible to enter the checkpoint fsync phase only to discover there are 4GB of dirty writes ahead of you, ones that have nothing to do with the checkpoint's I/O. Isn't adding another layer of cache the usual answer? The best would be in the OS, a fs with a big journal able to write sequentially a lot of blocks. If not and If you can spare at worst 2bit in memory per data blocks, don't mind preallocated data files (assuming meta data are stable then) and have a working mmap( MAP_NONBLOCK), and mincore() syscalls you could have a checkpoint in bound time, worst case you sequentially write the whole server RAM to a separate disk every checkpoint. Not sure I would trust such a beast with my data though :) Didier