Re: [HACKERS] New expression evaluator and indirect jumps

2017-04-01 Thread Andres Freund
Hi Jeff,

On 2017-04-01 17:36:42 -0700, Jeff Davis wrote:
> Thank you for your great work on the expression evaluator:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755
> 
> I was looking at the dispatch code, and it goes to significant effort
> (using computed goto) to generate many indirect jumps instead of just
> one. The reasoning is that CPUs can do better branch prediction when
> it can predict separately for each of the indirect jumps.

Right.


> But the paper here: https://hal.inria.fr/hal-01100647/document claims
> that it's not really needed on newer CPUs because they are better at
> branch prediction. I skimmed it, and if I understand correctly, modern
> branch predictors use some history, so it can predict based on the
> instructions executed before it got to the indirect jump.

Yea, it's true that the benefits on modern CPUs are smaller than they
used to be.  But, for one the branch history buffers are of very limited
size, which in many cases will make prediction an issue again.  For
another, the switch based dispatch has the issue that it'll still
perform boundary checks on the opcode, which has some performance
cost.


> I tried looking through the discussion on this list, but most seemed
> to resolve around which compilers generated the assembly we wanted
> rather than how much it actually improved performance. Can someone
> please point me to the numbers? Do they refute the conclusions in the
> paper, or are we concerned about a wider range of processors?

I ran a lot of benchmarks during development, and either there was no
performance difference between computed gotos and switch based
threading, or computed gotos come out ahead.  In expression heavy cases,
e.g. TPC-H Q01, there's a considerable advantage (~3.5% total, making it
something like ~15% expression evaluation speedup).  I primarily
evaluated performance on a skylake (i.e. newer than haswell), rather
than on my older nehalem workstation, to avoid optimizing for the wrong
thing.

I am not particularly concerned about !x86 processors, but a lot of them
indeed seem to have a lot less elaborate branch predictors (especially
ARM).  Also, nehalem and sandy bridge are still quite common out there,
especially in servers.

Since the cost of maintaining the computed goto stuff isn't that high,
I'm not really concerned here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] New expression evaluator and indirect jumps

2017-04-01 Thread Jeff Davis
Andres,

Thank you for your great work on the expression evaluator:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755

I was looking at the dispatch code, and it goes to significant effort
(using computed goto) to generate many indirect jumps instead of just
one. The reasoning is that CPUs can do better branch prediction when
it can predict separately for each of the indirect jumps.

But the paper here: https://hal.inria.fr/hal-01100647/document claims
that it's not really needed on newer CPUs because they are better at
branch prediction. I skimmed it, and if I understand correctly, modern
branch predictors use some history, so it can predict based on the
instructions executed before it got to the indirect jump.

I tried looking through the discussion on this list, but most seemed
to resolve around which compilers generated the assembly we wanted
rather than how much it actually improved performance. Can someone
please point me to the numbers? Do they refute the conclusions in the
paper, or are we concerned about a wider range of processors?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-04-01 Thread Ashutosh Sharma
Hi,

Thanks for the review.

>>> I am suspicious that _hash_kill_items() is going to have problems if
>>> the overflow page is freed before it reacquires the lock.
>>> _btkillitems() contains safeguards against similar cases.
>>
>> I have added similar check for hash_kill_items() as well.
>>
>
> I think hash_kill_items has a much bigger problem which is that we
> can't kill the items if the page has been modified after re-reading
> it.  Consider a case where Vacuum starts before the Scan on the bucket
> and then Scan went ahead (which is possible after your 0003 patch) and
> noted the killed items in killed item array and before it could kill
> all the items, Vacuum remove those items.  Now it is quite possible
> that before scan tries to kill those items, the corresponding itemids
> have been used by different tuples.  I think what we need to do here
> is to store the LSN of page first time when we have read the page and
> then compare it with current page lsn after re-reading the page in
> hash_kill_tems.

Okay, understood. I have done the changes to handle this type of
scenario. Please refer to the attached patches. Thanks.

>
> *
> + HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
> +} HashScanPosData;
> ..
>
> HashScanPosItem *killedItems; /* tids and offset numbers of killed items */
>   int numKilled; /* number of currently stored items */
> +
> + /*
> + * Identify all the matching items on a page and save them
> + * in HashScanPosData
> + */
> + HashScanPosData currPos; /* current position data */
>  } HashScanOpaqueData;
>
>
> After having array of HashScanPosItems as currPos.items, I think you
> don't need array of HashScanPosItem for killedItems, just an integer
> array of index in currPos.items should be sufficient.

Corrected.

>
>
> *
> I think below para in README is not valid after this patch.
>
> "To keep concurrency reasonably good, we require readers to cope with
> concurrent insertions, which means that they have to be able to
> re-find
> their current scan position after re-acquiring the buffer content lock
> on page.  Since deletion is not possible while a reader holds the pin
> on bucket, and we assume that heap tuple TIDs are unique, this can be
> implemented by searching for the same heap tuple TID previously
> returned.  Insertion does not move index entries across pages, so the
> previously-returned index entry should always be on the same page, at
> the same or higher offset number,
> as it was before."

I have modified above paragraph in README file.

>
> *
> - next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
> -   LH_OVERFLOW_PAGE,
> -   bstrategy);
> -
>
>   /*
> - * release the lock on previous page after acquiring the lock on next
> - * page
> + * As the hash index scan work in page at a time mode,
> + * vacuum can release the lock on previous page before
> + * acquiring lock on the next page.
>   */
> ..
> + next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
> +   LH_OVERFLOW_PAGE,
> +   bstrategy);
> +
>
> After this change, you need to modify comments on top of function
> hashbucketcleanup() and _hash_squeezebucket().
>

Done.

Please note that these patches needs to be applied on top of  [1].

[1] - 
https://www.postgresql.org/message-id/CAE9k0P%3D3rtgUtxopG%2BXQVxgASFzAnGd9sNmYUaj_%3DKeVsKGUdA%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 52429bb8b8ecbacd499de51235c0396ab09b17d8 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Sun, 2 Apr 2017 03:38:00 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev6

Patch by Ashutosh
---
 src/backend/access/hash/README   |  25 +-
 src/backend/access/hash/hash.c   | 150 +++-
 src/backend/access/hash/hashpage.c   |  17 +-
 src/backend/access/hash/hashsearch.c | 450 +++
 src/backend/access/hash/hashutil.c   |  51 +++-
 src/include/access/hash.h|  48 +++-
 6 files changed, 552 insertions(+), 189 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..063656d 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
- the pin on the primary bucket throughout the scan; we also maintain
- a pin on the page currently being scanned)
-	get tuple
-	release content lock
+ the pin on the primary bucket throughout the scan)
+save all the matching tuples from current index page into an items array
+release pin and content lock (but if it is primary bucket page retain
+it's pin till the end of scan)
+get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -254,15 +255,13 @@ Holding the buffer pin on the primary bucket page for the whole 

Re: [HACKERS] GSoC 2017 Proposal for "Explicitly support predicate locks in index access methods besides btree"

2017-04-01 Thread Kevin Grittner
On Sat, Apr 1, 2017 at 9:03 AM, anant khandelwal  wrote:

> From my findings the dependencies which create the anomalies in Snapshot
> isolation are:
>
> wr-dependencies: if T1 writes a version of an object, and T2
> reads that version, then T1 appears to have executed before T2
>  ww-dependencies: if T1 writes a version of some object, and
> T2 replaces that version with the next version, then T1 appears
> to have executed before T2
>  ww-dependencies: if T1 writes a version of some object, and
> T2 replaces that version with the next version, then T1 appears
> to have executed before T2

I think you meant to replace one of those ww-dependencies sections
with rw-dependencies -- which are, after all, the ones that cause
all the trouble :-).

> but due to the rw dependency caused by any insert into the index indexes
> other than the btree acquires relation level lock which causes serialization
> failure.

That's both a bit strong and a bit narrow.  A better statement might
be that the relation-level predicate locks can lead to false
positive serialization failures.  After all, detection of a single
write to the index doesn't all by itself lead to a serialization
failure, and the problem when it does is that it might be a "false
positive" -- in some cases where the serialization failure occurs
there would not actually be a serialization anomaly without that
failure.

> So the main task is to implement page-level predicate locking in the
> remaining core index AMs
>
> * Gist Index
>
> B+tree index acquires predicate locks only on leaf pages

Well, if memory serves, we sometimes need to lock at the relation
level (e.g., an empty index), but as far as *page* locks go, that
seems correct because a scan doesn't exit early during the descent
to the leaf level.

The point is that predicate locks on indexes are intended to lock
gaps between the index entries found.  Any index entry read at the
leaf level is dead (but not yet vacuumed away), points to a tuple
not visible to the scanning snapshot, or points to a tuple which
*is* visible to the snapshot.  In the first two cases we don't need
any predicate lock; in the last one the lock on the heap tuple
covers things.  What we need the locks on the indexes for is to
cover the "gaps" into which later writes may occur.

So the need is to ensure that during any insert of an index tuple
that points to a heap tuple we might have read had it existed at the
time of the index scan, we run into a predicate lock to allow us to
recognize the rw-dependency.

> But for Gist index, we have
> to consider locking at each index level as index tuples can be stored in
> buffers associated with internal pages or in leaf pages.

Yes, because a gist scan can return with a "not found" indication
before it gets to the leaf level.

> So, the functions where we need to insert a call for
>
> 1. PredicateLockPage()
>
> ->gistScanPage()
> after acquiring a shared lock on a buffer
>
> 2.CheckForSerializableConflictIn()
>
> -> gistdoinsert()
> after acquiring an exclusive lock on a target buffer and before inserting a
> tuple
>
>
> 3. PredicateLockPageSplit()
>
> ->gistplacetopage()
>
> If there is not enough space for insertion, we need to copy predicate lock
> from an old page to all new pages generated after a successful split
> operation.
>
> PredicateLockPageSplit(Relation relation, BlockNumber oldblkno, BlockNumber
> a lot) is used by b+-tree where two pages are involved in a split
> operation. For  Gist index, we can define a similar function where more
> than one page can be generated after split operation.

We could always call the existing function more than once, but there
might be enough performance benefit to justify a new function with
an array of page references.  I would want to do it without changing
the API first, and if time permits benchmark that versus a new
function.  Without measurable benefit, I don't want to complicate
the API.

> * Gin Index
>
> Gin index consists of a B-tree index over key values, where each key is an
> element of some indexed items and each tuple in a leaf page contains either
> a posting list if the list is small enough or a pointer to posting tree.
>
> 1. PredicateLockPage()
>
> ->startscanentry()
>before calling collectMatchBitmap()
>
> ->scanPostingTree()
>after acquiring a shared lock on a leaf page
>
> 2.CheckForSerializableConflictIn()
>
> -> ginentryinsert()
>
> ->gininsertitempointers()
>in case of insertion in a posting tree
>
>
> 3. PredicateLockPageSplit()
>
> -> dataBeginPlacetoPageLeaf()
>
>  after calling dataPlacetoPageLeafSplit()
>
> * Hash Index
>
> 1. PredicateLockPage()
>
> ->hashgettuple()
> ->_hash_first()
> ->_hash_readnext()
> ->_hash_readprev()
>
> 2.CheckForSerializableConflictIn()
>
> -> _hash_doinsert()
>
> 3. PredicateLockPageSplit()

I have not looked in detail at this point, but that seems plausible
and indicates admirable research effort in drafting this proposal.

> This is just an idea 

Re: [HACKERS] [PATCH] few fts functions for jsonb

2017-04-01 Thread Andrew Dunstan


On 03/31/2017 03:17 PM, Oleg Bartunov wrote:
>
>
> On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthali...@gmail.com
> > wrote:
>
> On 31 March 2017 at 00:01, Andrew Dunstan
>  > wrote:
> >
> > I have just noticed as I was writing/testing the non-existent
> docs for
> > this patch that it doesn't supply variants of to_tsvector that
> take a
> > regconfig as the first argument. Is there a reason for that? Why
> > should the json(b) versions be different from the text versions?
>
> No, there is no reason, I just missed that. Here is a new version
> of the patch (only the functions part)
> to add those variants.
>
>
> Congratulations with patch committed, who will write an addition
> documentation? I think we need to touch  FTS and JSON parts.


I added documentation when I committed it for the new functions, in the
FTS section. I'm not sure what we need to add to the JSON section if
anything.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Generic type subscripting

2017-04-01 Thread Arthur Zakirov
2017-03-28 19:31 GMT+03:00 Dmitry Dolgov <9erthali...@gmail.com>:
> On 28 March 2017 at 12:08, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>>
>> Wow, I didn't notice that, sorry - will fix it shortly.
>
> So, here is the corrected version of the patch.

Thank you!

The patch looks good to me. But it is not applied :) I think it is
because of new FTS functions for jsonb. The patch need rebasing.

But, from my point of view, it would be nice if the code mentioned
earlier was improved:

> + foreach(l, sbsref->reflowerindexpr)
> + {
> + List *expr_ai = (List *) lfirst(l);
> + A_Indices *ai = (A_Indices *) lfirst(list_tail(expr_ai));
> +
> + subexpr = (Node *) lfirst(list_head(expr_ai));

It is necessary for arrays of course because of logic mentioned in the
documentation.

> If any dimension is written as a slice, i.e., contains a colon, then all 
> dimensions are treated as slices. Any dimension that has only a single number 
> (no colon) is treated as being from 1 to the number specified.

But it would be better if SubscriptingRef structure had a new field of
List type. This field can store list of booleans, which means is there
slices or not. I think it can improve readability of the code.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BRIN de-summarize ranges

2017-04-01 Thread Alvaro Herrera
Alvaro Herrera wrote:

> > However, I found that when calling brin_desummarize_range
> > successively, an assertion is failed. It seems to me that it occurs
> > when desummarizing a revmap page that is already desummarized.
> 
> You're right, it's broken for that case.  Here's a fixed patch.

Pushed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] brin autosummarization -- autovacuum "work items"

2017-04-01 Thread Jeff Janes
On Sat, Apr 1, 2017 at 10:09 AM, Alvaro Herrera 
wrote:

> Alvaro Herrera wrote:
>
> > I also removed the behavior that on index creation the final partial
> > block range is always summarized.  It's pointless.
>
> I just pushed this, without this change, because it breaks
> src/test/modules/brin.  I still think it's pointless, but it'd require
> more than one line to change.
>


This is failing for me (and the entire build farm, it looks like).

Cheers,

Jeff


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New CORRESPONDING clause design

2017-04-01 Thread Tom Lane
Pavel Stehule  writes:
> [ corresponding_clause_v12.patch ]

I worked on this for awhile but eventually decided that it's not very
close to being committable.  The main thing that's scaring me off is
a realization that there are a *lot* of places that assume that the
output columns of a set operation are one-for-one with the columns of
the inputs.  One such place is the subquery qual pushdown logic in
allpaths.c, which this patch hasn't touched, resulting in

regression=# create table t1 (a int, b int, c int);
CREATE TABLE
regression=# create table t2 (a int, b int, c int);
CREATE TABLE
regression=# create view vv2 as
regression-# select * from t1 union all corresponding by (b,c) select * from t2;
CREATE VIEW
regression=# select * from vv2 where c = 44;
ERROR:  wrong number of tlist entries

Another is the code that converts a UNION ALL subquery into an appendrel.
The best thing there might be to just reject CORRESPONDING in
is_simple_union_all, but this patch doesn't, which means we get the wrong
results from

regression=# create table t3 (b int, a int, c int);
CREATE TABLE
regression=# explain verbose select * from t1 union all corresponding select * 
from t3;
 QUERY PLAN 

 Append  (cost=0.00..60.80 rows=4080 width=12)
   ->  Seq Scan on public.t1  (cost=0.00..30.40 rows=2040 width=12)
 Output: t1.a, t1.b, t1.c
   ->  Seq Scan on public.t3  (cost=0.00..30.40 rows=2040 width=12)
 Output: t3.b, t3.a, t3.c
(5 rows)

Notice it's failed to rearrange the columns to match.

There are also such assumptions in ruleutils.c.  Trying to reverse-list
the above view gives

regression=# \d+ vv2
 View "public.vv2"
 Column |  Type   | Collation | Nullable | Default | Storage | Description 
+-+---+--+-+-+-
 b  | integer |   |  | | plain   | 
 c  | integer |   |  | | plain   | 
View definition:
 SELECT t1.a AS b,
t1.b AS c,
t1.c
   FROM t1
UNION ALL CORRESPONDING BY (b, c)
 SELECT t2.a AS b,
t2.b AS c,
t2.c
   FROM t2;

which is obviously wrong.  The reason for that is that the code is
trying to ensure that the SELECT's output column names match the
view's column names, so it sticks AS clauses on the select-list
expressions.  If we go a little further:

regression=# alter table vv2 rename column c to ceetwo;
ALTER TABLE
regression=# \d+ vv2
 View "public.vv2"
 Column |  Type   | Collation | Nullable | Default | Storage | Description 
+-+---+--+-+-+-
 b  | integer |   |  | | plain   | 
 ceetwo | integer |   |  | | plain   | 
View definition:
 SELECT t1.a AS b,
t1.b AS ceetwo,
t1.c
   FROM t1
UNION ALL CORRESPONDING BY (b, c)
 SELECT t2.a AS b,
t2.b AS ceetwo,
t2.c
   FROM t2;

Now things are a *complete* mess, because this view definition would
successfully parse during a dump-and-reload, but it means something
else than it did before; the CORRESPONDING BY list has failed to track
the change in names of the columns.

In general, there's a lot of subtle logic in ruleutils.c about what we
need to do to ensure that reverse-listed views keep the same semantic
meaning in the face of column renamings or additions in their input
tables.  CORRESPONDING is really scary in this situation because its
semantics depend on column names.  I tried to break it by adding/renaming
columns of the input tables to see if I could get ruleutils to print
something that didn't mean what it meant before.  I didn't succeed
immediately, but I am thinking it would be smart for us always to
reverse-list the construct as CORRESPONDING BY with an explicit list of
column names.  We don't ever reverse-list a NATURAL JOIN as such,
preferring to use an explicit JOIN USING list, for closely related
reasons.  (You might care to study the logic in ruleutils.c around the
usingNames list, which exists to ensure that JOIN USING doesn't get broken
by column renames.  It's entirely likely that we're going to need
something equivalently complicated for CORRESPONDING BY.  Or if we
don't, I'd want to have some comments in there that clearly explain
why we don't, because otherwise somebody will break it in future.)

I also found that WITH RECURSIVE fails immediately if you stick
CORRESPONDING into the recursive union step; eg in this lightly
modified version of a query from with.sql,

CREATE TEMPORARY TABLE tree(
id INTEGER PRIMARY KEY,
parent_id INTEGER REFERENCES tree(id)
);

WITH RECURSIVE t(id, path) AS (
select 1 as id, ARRAY[]::integer[] as path
UNION ALL CORRESPONDING
SELECT tree.id, t.path || tree.id as path
FROM tree JOIN t ON (tree.parent_id = t.id)
)
SELECT t1.*, t2.* 

Re: [HACKERS] BRIN de-summarize ranges

2017-04-01 Thread Alvaro Herrera
Seki, Eiji wrote:

> However, I found that when calling brin_desummarize_range
> successively, an assertion is failed. It seems to me that it occurs
> when desummarizing a revmap page that is already desummarized.

You're right, it's broken for that case.  Here's a fixed patch.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From d584e88386dd8c775a7b147b3dba328c26585858 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 28 Feb 2017 01:45:21 -0300
Subject: [PATCH] Support BRIN de-summarization
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When the BRIN summary tuple for a page range becomes too wide for the
values stored actually stored (because the tuples that were present
originally are no longer present due to updates or deletes) it can be
useful to remove the outdated summary tuple, so that a future
summarization can install a tighter summary.

This commit introduces a SQL-callable interface to do so.

(No user-level docs on the new function yet; those will be provided by
Simon Riggs later)

Author: Álvaro Herrera
Reviewed-by: Eiji Seki
Discussion: https://postgr.es/m/20170228045643.n2ri74ara4fhhfxf@alvherre.pgsql
---
 src/backend/access/brin/brin.c |  63 +++
 src/backend/access/brin/brin_revmap.c  | 140 -
 src/backend/access/brin/brin_xlog.c|  43 ++
 src/backend/access/rmgrdesc/brindesc.c |  10 +++
 src/include/access/brin_revmap.h   |   1 +
 src/include/access/brin_xlog.h |  20 -
 src/include/catalog/pg_proc.h  |   2 +
 src/test/regress/sql/brin.sql  |   8 ++
 8 files changed, 283 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 86e73b6..7a33607 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -909,6 +909,69 @@ brin_summarize_range(PG_FUNCTION_ARGS)
 }
 
 /*
+ * SQL-callable interface to mark a range as no longer summarized
+ */
+Datum
+brin_desummarize_range(PG_FUNCTION_ARGS)
+{
+   Oid indexoid = PG_GETARG_OID(0);
+   BlockNumber heapBlk = PG_GETARG_INT64(1);
+   Oid heapoid;
+   Relation heapRel;
+   Relation indexRel;
+   booldone;
+
+   /*
+* We must lock table before index to avoid deadlocks.  However, if the
+* passed indexoid isn't an index then IndexGetRelation() will fail.
+* Rather than emitting a not-very-helpful error message, postpone
+* complaining, expecting that the is-it-an-index test below will fail.
+*/
+   heapoid = IndexGetRelation(indexoid, true);
+   if (OidIsValid(heapoid))
+   heapRel = heap_open(heapoid, ShareUpdateExclusiveLock);
+   else
+   heapRel = NULL;
+
+   indexRel = index_open(indexoid, ShareUpdateExclusiveLock);
+
+   /* Must be a BRIN index */
+   if (indexRel->rd_rel->relkind != RELKIND_INDEX ||
+   indexRel->rd_rel->relam != BRIN_AM_OID)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("\"%s\" is not a BRIN index",
+   
RelationGetRelationName(indexRel;
+
+   /* User must own the index (comparable to privileges needed for VACUUM) 
*/
+   if (!pg_class_ownercheck(indexoid, GetUserId()))
+   aclcheck_error(ACLCHECK_NOT_OWNER, ACL_KIND_CLASS,
+  RelationGetRelationName(indexRel));
+
+   /*
+* Since we did the IndexGetRelation call above without any lock, it's
+* barely possible that a race against an index drop/recreation could 
have
+* netted us the wrong table.  Recheck.
+*/
+   if (heapRel == NULL || heapoid != IndexGetRelation(indexoid, false))
+   ereport(ERROR,
+   (errcode(ERRCODE_UNDEFINED_TABLE),
+errmsg("could not open parent table of index 
%s",
+   
RelationGetRelationName(indexRel;
+
+   /* the revmap does the hard work */
+   do {
+   done = brinRevmapDesummarizeRange(indexRel, heapBlk);
+   }
+   while (!done);
+
+   relation_close(indexRel, ShareUpdateExclusiveLock);
+   relation_close(heapRel, ShareUpdateExclusiveLock);
+
+   PG_RETURN_VOID();
+}
+
+/*
  * Build a BrinDesc used to create or scan a BRIN index
  */
 BrinDesc *
diff --git a/src/backend/access/brin/brin_revmap.c 
b/src/backend/access/brin/brin_revmap.c
index 5d45b48..35e53a2 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -168,9 +168,12 @@ brinSetHeapBlockItemptr(Buffer buf, BlockNumber 
pagesPerRange,
iptr = (ItemPointerData *) 

[HACKERS] Re: GSoC 2017 Proposal for "Explicitly support predicate locks in index access methods besides btree"

2017-04-01 Thread anant khandelwal
Also can i use the testing tool reside at src/test/isolation.

or i should use the testing tool dtester from markus wanner

On Sat, Apr 1, 2017 at 7:33 PM, anant khandelwal 
wrote:

> Hi guys,
>
> My name is Anant Khandelwal currently i am pursuing masters from IIT -
> Delhi and previously i am a software engineer.
>
> I am particularly interested in working on the project "Explicitly support
> predicate locks in index access methods besides b tree".I have gone through
>
> http://hdl.handle.net/2123/5353
> http://drkp.net/papers/ssi-vldb12.pdf
>
> to understand how serializable implemented through the use of Snapshot
> isolation in PostgreSQL also gone through the codebase get some idea of
> different access methods for gin, gist, and hash indexes.
> I want to discuss my idea.Then i move to make the proposal
>
> Sorry for late i am busy with the academics.
>
> From my findings the dependencies which create the anomalies in Snapshot
> isolation are:
>
> *wr-dependencies*: if T1 writes a version of an object, and T2
> reads that version, then T1 appears to have executed before T2
> * ww-dependencies*: if T1 writes a version of some object, and
> T2 replaces that version with the next version, then T1 appears
> to have executed before T2
>  *ww-dependencies*: if T1 writes a version of some object, and
> T2 replaces that version with the next version, then T1 appears
> to have executed before T2
>
> where T1 and T2 are the transaction
>
> but due to the rw dependency caused by any insert into the index indexes
> other than the btree acquires relation level lock which causes
> serialization failure.So the main task is to implement page-level predicate
> locking in the remaining core index AMs
>
> * Gist Index
>
> B+tree index acquires predicate locks only on leaf pages as index tuples
> with record pointers are stored on leaf pages. But for Gist index, we have
> to consider locking at each index level as index tuples can be stored in
> buffers associated with internal pages or in leaf pages.
> So, the functions where we need to insert a call for
>
> 1. PredicateLockPage()
>
> ->gistScanPage()
> after acquiring a shared lock on a buffer
>
> 2.CheckForSerializableConflictIn()
>
> -> gistdoinsert()
> after acquiring an exclusive lock on a target buffer and before inserting a
> tuple
>
>
> 3. PredicateLockPageSplit()
>
> ->gistplacetopage()
>
> If there is not enough space for insertion, we need to copy predicate lock
> from an old page to all new pages generated after a successful split
> operation.
>
> PredicateLockPageSplit(Relation relation, BlockNumber oldblkno,
> BlockNumber
> a lot) is used by b+-tree where two pages are involved in a split
> operation. For  Gist index, we can define a similar function where more
> than one page can be generated after split operation.
>
> * Gin Index
>
> Gin index consists of a B-tree index over key values, where each key is an
> element of some indexed items and each tuple in a leaf page contains either
> a posting list if the list is small enough or a pointer to posting tree.
>
> 1. PredicateLockPage()
>
> ->startscanentry()
>before calling collectMatchBitmap()
>
> ->scanPostingTree()
>after acquiring a shared lock on a leaf page
>
> 2.CheckForSerializableConflictIn()
>
> -> ginentryinsert()
>
> ->gininsertitempointers()
>in case of insertion in a posting tree
>
>
> 3. PredicateLockPageSplit()
>
> -> dataBeginPlacetoPageLeaf()
>
>  after calling dataPlacetoPageLeafSplit()
>
> * Hash Index
>
> 1. PredicateLockPage()
>
> ->hashgettuple()
> ->_hash_first()
> ->_hash_readnext()
> ->_hash_readprev()
>
> 2.CheckForSerializableConflictIn()
>
> -> _hash_doinsert()
>
> 3. PredicateLockPageSplit()
>
>
> This is just an idea also i understand that  Page level predicate locking
> exists in the btree AM, where it required the addition of 17 function calls
> to implement, but remains unimplemented in the gin, gist, spgist, brin, and
> hash index AMs. So we nned to insert function calls at other places also.
>
>  Also tell me can we evaluate the performance on PostgreSQL on the
> following workloads
>
>
>- SIBENCH microbenchMark
>- TPC-c++
>
>
>
>
>
>
>


Re: [HACKERS] brin autosummarization -- autovacuum "work items"

2017-04-01 Thread Alvaro Herrera
Alvaro Herrera wrote:

> I also removed the behavior that on index creation the final partial
> block range is always summarized.  It's pointless.

I just pushed this, without this change, because it breaks
src/test/modules/brin.  I still think it's pointless, but it'd require
more than one line to change.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] brin autosummarization -- autovacuum "work items"

2017-04-01 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Mar 21, 2017 at 4:10 PM, Alvaro Herrera
>  wrote:
> > Well, the number of work items is currently fixed; but if you have many
> > BRIN indexes, then you'd overflow (lose requests).  By using DSA I am
> > making it easy to patch this afterwards so that an arbitrary number of
> > requests can be recorded.
> 
> But that could also use an arbitrarily large amount of memory, and any
> leaks will be cluster-lifespan.

Good point -- probably not such a great idea as presented.  The patch
only uses a fixed amount of memory currently, so it should be fine on
that front.  I think this is a good enough start.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

2017-04-01 Thread Magnus Hagander
On Fri, Mar 31, 2017 at 8:59 AM, Magnus Hagander 
wrote:

>
> On Wed, Mar 29, 2017 at 1:05 PM, Michael Banck 
> wrote:
>
>> Hi,
>>
>> Am Montag, den 27.02.2017, 16:20 +0100 schrieb Magnus Hagander:
>> > On Sun, Feb 26, 2017 at 9:59 PM, Tom Lane  wrote:
>> > Is there an argument for back-patching this?
>> >
>> >
>> > Seems you were typing that at the same time as we did.
>> >
>> >
>> > I'm considering it, but not swayed in either direction. Should I take
>> > your comment as a vote that we should back-patch it?
>>
>> I've checked back into this thread, and there seems to be a +1 from Tom
>> and a +(0.5-1) from Simon for backpatching, and no obvious -1s. Did you
>> decide against it in the end, or is this still an open item?
>
>
> No, I plan to work on it, so it's still an open item. I've been backlogged
> with other things, but I will try to get too it soon.
>
> (This also includes considering Jeff's note)
>
>
I've applied a backpatch to 9.4. Prior to that pretty much the entire patch
is a conflict, so it would need a full rewrite.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

2017-04-01 Thread Magnus Hagander
On Mon, Feb 27, 2017 at 7:46 PM, Jeff Janes  wrote:

> On Sun, Feb 26, 2017 at 12:32 PM, Magnus Hagander 
> wrote:
>
>>
>> On Sun, Feb 26, 2017 at 8:27 PM, Michael Banck > > wrote:
>>
>>> Hi,
>>>
>>> Am Dienstag, den 14.02.2017, 18:18 -0500 schrieb Robert Haas:
>>> > On Tue, Feb 14, 2017 at 4:06 PM, Alvaro Herrera
>>> >  wrote:
>>> > > I'd rather have a --quiet mode instead.  If you're running it by
>>> hand,
>>> > > you're likely to omit the switch, whereas when writing the cron job
>>> > > you're going to notice lack of switch even before you let the job run
>>> > > once.
>>> >
>>> > Well, that might've been a better way to design it, but changing it
>>> > now would break backward compatibility and I'm not really sure that's
>>> > a good idea.  Even if it is, it's a separate concern from whether or
>>> > not in the less-quiet mode we should point out that we're waiting for
>>> > a checkpoint on the server side.
>>>
>>> ISTM the consensus is that there should be no output in regular mode,
>>> but a message should be displayed in verbose and progress mode.
>>>
>>> So I went forth and also added a message in progress mode (unless
>>> verbose messages are requested anyway).
>>>
>>> Regarding the documentation, I tried to clarify the difference between
>>> the checkpoint types a bit more, but I think any further action is
>>> probably a larger rewrite of the documentation on this topic.
>>>
>>> So attached are two patches, I've split it up in the documentation and
>>> the code output part. I'll add it as one commitfest entry in the
>>> "Clients" section though, as it's not really a big patch, unless
>>> somebody thinks it should have a secon entry in "Documentation"?
>>
>>
>> Agreed, and applied as one patch. Except I noticed you also fixed a
>> couple of entries which were missing the progname in the messages -- I
>> broke those out to a separate patch instead.
>>
>> Made a small change to "using as much I/O as available" rather than "as
>> possible", which I think is a better wording, along with the change of the
>> idle wording I suggested before. (but feel free to point it out to me if
>> that's wrong).
>>
>
> Should the below fprintf end in a \r rather than a \n, so that the the
> progress message gets over-written once the checkpoint is done and we have
> moved on?
>
> if (showprogress && !verbose)
> fprintf(stderr, "waiting for checkpoint\n");
>
> That would seem more in keeping with how the other progress messages
> operate.
>
>
Agreed, that makes more sense. I've pushed a patch that does this.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] [COMMITTERS] pgsql: Avoid GatherMerge crash when there are no workers.

2017-04-01 Thread Robert Haas
On Fri, Mar 31, 2017 at 10:26 PM, Andres Freund  wrote:
> Hi,
>
> On 2017-04-01 01:22:14 +, Robert Haas wrote:
>> Avoid GatherMerge crash when there are no workers.
>
> I think the gather merge code needs a bit more test coverage (sorry to
> make this a larger theme today).  As shown by
> https://coverage.postgresql.org/src/backend/executor/nodeGatherMerge.c.gcov.html
> we don't actually merge anything (heap_compare_slots is not exercised).

Sounds reasonable.

> I btw also wonder if it's good that we have a nearly identical copy of
> heap_compare_slots and a bunch of the calling code in both
> nodeMergeAppend.c and nodeGatherMerge.c.  On the other hand, it's not
> heavily envolving code.

Yeah, I don't know.  We could alternatively try to move that to some
common location and merge the two implementations.  I'm not sure
exactly where, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multiple false-positive warnings from Valgrind

2017-04-01 Thread Michael Paquier
On Sat, Apr 1, 2017 at 2:51 PM, Noah Misch  wrote:
> Does this remove the noise under --with-openssl?
>
> --- a/src/port/pg_strong_random.c
> +++ b/src/port/pg_strong_random.c
> @@ -104,7 +104,10 @@ pg_strong_random(void *buf, size_t len)
>  */
>  #if defined(USE_OPENSSL_RANDOM)
> if (RAND_bytes(buf, len) == 1)
> +   {
> +   VALGRIND_MAKE_MEM_DEFINED(buf, len);
> return true;
> +   }
> return false;
>
> /*

Actually, no. I'll dig more into the options of valgrind to see if I
am missing something here..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GSoC 2017 Proposal for "Explicitly support predicate locks in index access methods besides btree"

2017-04-01 Thread anant khandelwal
Hi guys,

My name is Anant Khandelwal currently i am pursuing masters from IIT -
Delhi and previously i am a software engineer.

I am particularly interested in working on the project "Explicitly support
predicate locks in index access methods besides b tree".I have gone through

http://hdl.handle.net/2123/5353
http://drkp.net/papers/ssi-vldb12.pdf

to understand how serializable implemented through the use of Snapshot
isolation in PostgreSQL also gone through the codebase get some idea of
different access methods for gin, gist, and hash indexes.
I want to discuss my idea.Then i move to make the proposal

Sorry for late i am busy with the academics.

>From my findings the dependencies which create the anomalies in Snapshot
isolation are:

*wr-dependencies*: if T1 writes a version of an object, and T2
reads that version, then T1 appears to have executed before T2
* ww-dependencies*: if T1 writes a version of some object, and
T2 replaces that version with the next version, then T1 appears
to have executed before T2
 *ww-dependencies*: if T1 writes a version of some object, and
T2 replaces that version with the next version, then T1 appears
to have executed before T2

where T1 and T2 are the transaction

but due to the rw dependency caused by any insert into the index indexes
other than the btree acquires relation level lock which causes
serialization failure.So the main task is to implement page-level predicate
locking in the remaining core index AMs

* Gist Index

B+tree index acquires predicate locks only on leaf pages as index tuples
with record pointers are stored on leaf pages. But for Gist index, we have
to consider locking at each index level as index tuples can be stored in
buffers associated with internal pages or in leaf pages.
So, the functions where we need to insert a call for

1. PredicateLockPage()

->gistScanPage()
after acquiring a shared lock on a buffer

2.CheckForSerializableConflictIn()

-> gistdoinsert()
after acquiring an exclusive lock on a target buffer and before inserting a
tuple


3. PredicateLockPageSplit()

->gistplacetopage()

If there is not enough space for insertion, we need to copy predicate lock
from an old page to all new pages generated after a successful split
operation.

PredicateLockPageSplit(Relation relation, BlockNumber oldblkno, BlockNumber
a lot) is used by b+-tree where two pages are involved in a split
operation. For  Gist index, we can define a similar function where more
than one page can be generated after split operation.

* Gin Index

Gin index consists of a B-tree index over key values, where each key is an
element of some indexed items and each tuple in a leaf page contains either
a posting list if the list is small enough or a pointer to posting tree.

1. PredicateLockPage()

->startscanentry()
   before calling collectMatchBitmap()

->scanPostingTree()
   after acquiring a shared lock on a leaf page

2.CheckForSerializableConflictIn()

-> ginentryinsert()

->gininsertitempointers()
   in case of insertion in a posting tree


3. PredicateLockPageSplit()

-> dataBeginPlacetoPageLeaf()

 after calling dataPlacetoPageLeafSplit()

* Hash Index

1. PredicateLockPage()

->hashgettuple()
->_hash_first()
->_hash_readnext()
->_hash_readprev()

2.CheckForSerializableConflictIn()

-> _hash_doinsert()

3. PredicateLockPageSplit()


This is just an idea also i understand that  Page level predicate locking
exists in the btree AM, where it required the addition of 17 function calls
to implement, but remains unimplemented in the gin, gist, spgist, brin, and
hash index AMs. So we nned to insert function calls at other places also.

 Also tell me can we evaluate the performance on PostgreSQL on the
following workloads


   - SIBENCH microbenchMark
   - TPC-c++


Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-04-01 Thread Ashutosh Sharma
On Apr 1, 2017 18:11, "Amit Kapila"  wrote:

On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas  wrote:
> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma 
wrote:
>> Well, That is another option but the main idea was to be inline with
>> the btree code.
>
> That's not a bad goal in principal, but _bt_killitems() doesn't have
> any similar argument.
>
> (Also, speaking of consistency, why did we end up with
> _hash_kill_items, with an underscore between kill and items, and
> _bt_killitems, without one?)
>
>> Moreover, I am not sure if acquiring lwlock inside
>> hashendscan (basically the place where we are trying to close down the
>> things) would look good.
>
> Well, if that's not a good thing to do, hiding it inside some other
> function doesn't make it better.  I think it's fine, though.
>

One thing to note here is that this fix won't be required if we get
the page-scan-mode patch [1] in this CF.  I think if we fix this with
the patch proposed by Ashutosh, then we anyway needs to again change
the related code (kind of revert the fix) after page-scan-mode patch.
Now, if you think we have negligible chance of getting that patch,
then it is good to proceed with this fix.


Yes, I had already mentioned this point in my very first mail. Thanks for
highlighting this once again.

[1] - https://commitfest.postgresql.org/13/999/


Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-04-01 Thread Amit Kapila
On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas  wrote:
> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma  
> wrote:
>> Well, That is another option but the main idea was to be inline with
>> the btree code.
>
> That's not a bad goal in principal, but _bt_killitems() doesn't have
> any similar argument.
>
> (Also, speaking of consistency, why did we end up with
> _hash_kill_items, with an underscore between kill and items, and
> _bt_killitems, without one?)
>
>> Moreover, I am not sure if acquiring lwlock inside
>> hashendscan (basically the place where we are trying to close down the
>> things) would look good.
>
> Well, if that's not a good thing to do, hiding it inside some other
> function doesn't make it better.  I think it's fine, though.
>

One thing to note here is that this fix won't be required if we get
the page-scan-mode patch [1] in this CF.  I think if we fix this with
the patch proposed by Ashutosh, then we anyway needs to again change
the related code (kind of revert the fix) after page-scan-mode patch.
Now, if you think we have negligible chance of getting that patch,
then it is good to proceed with this fix.

[1] - https://commitfest.postgresql.org/13/999/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioning vs ON CONFLICT

2017-04-01 Thread Rukh Meski
On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas  wrote:
> On Fri, Mar 31, 2017 at 5:33 PM, Peter Geoghegan  wrote:
>> In my opinion, for the very limited ON CONFLICT DO NOTHING + no
>> inference specification case, the implementation should not care about
>> the presence or absence of unique indexes within or across partitions.
>
> Hmm.  That's an interesting point.  The documentation says:
>
> ON CONFLICT can be used to specify an alternative action to raising a
> unique constraint or exclusion constraint violation error.
>
> And, indeed, you could get an unique constraint or exclusion error
> because of an index on the child even though it's not global to the
> partitioning hierarchy.  So maybe we can support this after all, but
> having messed it up once, I'm inclined to think we should postpone
> this to v11, think it over some more, and try to make sure that our
> second try doesn't crash...

Naturally this means that the partitioning work will be reverted as
well, since we have a consensus that new features shouldn't make
preexisting ones worse. It's a shame, since I was really hoping to see
it in 10.0.

♜


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-04-01 Thread Amit Kapila
On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma  wrote:
>>
>> I am suspicious that _hash_kill_items() is going to have problems if
>> the overflow page is freed before it reacquires the lock.
>> _btkillitems() contains safeguards against similar cases.
>
> I have added similar check for hash_kill_items() as well.
>

I think hash_kill_items has a much bigger problem which is that we
can't kill the items if the page has been modified after re-reading
it.  Consider a case where Vacuum starts before the Scan on the bucket
and then Scan went ahead (which is possible after your 0003 patch) and
noted the killed items in killed item array and before it could kill
all the items, Vacuum remove those items.  Now it is quite possible
that before scan tries to kill those items, the corresponding itemids
have been used by different tuples.  I think what we need to do here
is to store the LSN of page first time when we have read the page and
then compare it with current page lsn after re-reading the page in
hash_kill_tems.

*
+ HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+} HashScanPosData;
..

HashScanPosItem *killedItems; /* tids and offset numbers of killed items */
  int numKilled; /* number of currently stored items */
+
+ /*
+ * Identify all the matching items on a page and save them
+ * in HashScanPosData
+ */
+ HashScanPosData currPos; /* current position data */
 } HashScanOpaqueData;


After having array of HashScanPosItems as currPos.items, I think you
don't need array of HashScanPosItem for killedItems, just an integer
array of index in currPos.items should be sufficient.


*
I think below para in README is not valid after this patch.

"To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to
re-find
their current scan position after re-acquiring the buffer content lock
on page.  Since deletion is not possible while a reader holds the pin
on bucket, and we assume that heap tuple TIDs are unique, this can be
implemented by searching for the same heap tuple TID previously
returned.  Insertion does not move index entries across pages, so the
previously-returned index entry should always be on the same page, at
the same or higher offset number,
as it was before."

*
- next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-   LH_OVERFLOW_PAGE,
-   bstrategy);
-

  /*
- * release the lock on previous page after acquiring the lock on next
- * page
+ * As the hash index scan work in page at a time mode,
+ * vacuum can release the lock on previous page before
+ * acquiring lock on the next page.
  */
..
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+   LH_OVERFLOW_PAGE,
+   bstrategy);
+

After this change, you need to modify comments on top of function
hashbucketcleanup() and _hash_squeezebucket().



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel bitmapscan isn't exercised in regression tests

2017-04-01 Thread Dilip Kumar
On Sat, Apr 1, 2017 at 12:16 AM, Andres Freund  wrote:
> Hi,
>
> The parallel code-path isn't actually exercised in the tests added in
> [1], as evidenced by [2] (they just explain).  That imo needs to be
> fixed.

Thanks for reporting. Attached patch fixes that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel_bitmap_test.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-04-01 Thread Ashutosh Sharma
Hi Andreas,

On Apr 1, 2017 16:15, "Andreas Seltenreich"  wrote:

Andreas Seltenreich writes:
>>> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*)
(&(bufHdr)->content_lock", File: "bufmgr.c", Line: 3397)
> I got about one TRAP per hour when testing on 20 nodes with one postgres
> and 5 sqlsmithes on each.

> Ashutosh Sharma writes:
>> [2. reacquire_lock_hashkillitems_if_required.patch]

I've done about 12 hours of testing since applying this patch and no
failed assertions so far.



Thanks for testing my patch. Just FYI, I have slightly changed my patch
this morning as per Robert's suggestions. But, still more or less the logic
remains the same. Thank you once again.


Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-04-01 Thread Andreas Seltenreich
Andreas Seltenreich writes:
>>> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) 
>>> (&(bufHdr)->content_lock", File: "bufmgr.c", Line: 3397)
> I got about one TRAP per hour when testing on 20 nodes with one postgres
> and 5 sqlsmithes on each.

> Ashutosh Sharma writes:
>> [2. reacquire_lock_hashkillitems_if_required.patch]

I've done about 12 hours of testing since applying this patch and no
failed assertions so far.

regards,
Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)

2017-04-01 Thread Ashutosh Sharma
Hi Tomas,

On Fri, Mar 31, 2017 at 11:05 PM, Tomas Vondra 
wrote:
> On 03/31/2017 06:01 PM, Ashutosh Sharma wrote:
>>
>>
>> It seems like the latest patch(v4) shared by Tomas upthread is an
>> empty patch. If I am not wrong, please share the correct patch.
>> Thanks.
>>
>
> OMG, this is the second time I managed to generate an empty patch. I
really
> need to learn not to do that ..
>
> Attached is v5 of the patch, hopefully correct this time ;-)

Yes, it was correct. I have spent some time on reviewing your patch today
and the review comments are as follows.

1. It would be good to have a blank line in between following set of lines
describing bt_page_print_tuples.

+/*
+ * bt_page_print_tuples
+ * form a tuple describing index tuple at a given offset
+ */


Something like this,

+/* -
+ * bt_page_print_tuples()
+ *
+ * Form a tuple describing index tuple at a given offset
+ * -
+ */

Please note that, if you do not agree with my suggestion, you may ignore it
:)

2. I think the following comments for bt_page_items needs to be moved to
the right place in the code.

/*-
 * bt_page_items()
 *
 * Get IndexTupleData set in a btree page
 *
 * Usage: SELECT * FROM bt_page_items('t1_pkey', 1);
 *-
 */

/*
 * cross-call data structure for SRF
 */




*struct user_args{Pagepage;OffsetNumber offset;};*

With your patch, above structure definition is followed by the definition
for bt_page_print_tuples() not bt_page_items(), hence you may need to put
the comments for bt_page_items just above it's definition.

3. I am seeing a server crash when running the sql statement 'SELECT * FROM
bt_page_items('bt_i4_index', 1);'. Here is the backtrace.

I think this crash is happening because in bt_page_items, without reading
opaque region of a page, you have added this if statement,




*if (P_ISMETA(opaque))ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),  errmsg("block
is a meta page")));*

Please pull back above if statement below the following line to get rid of
this crash.

opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);

OR, the best thing to do would be to retain the earlier if statement for
checking meta page because that way you can avoid reading a
buffer unnecessarily.

#0  0x7f592975030c in bt_page_items (fcinfo=0x7fffe5fdb860) at
btreefuncs.c:352
352 if (P_ISMETA(opaque))
Missing separate debuginfos, use: debuginfo-install glibc-2.17-78.el7.x86_64
(gdb) bt
#0  0x7f592975030c in bt_page_items (fcinfo=0x7fffe5fdb860) at
btreefuncs.c:352
#1  0x006797e0 in ExecMakeTableFunctionResult (setexpr=0x21269f8,
econtext=0x2126738, argContext=0x2136c98, expectedDesc=0x2155178,
randomAccess=0 '\000') at execSRF.c:230
#2  0x00689dda in FunctionNext (node=0x2126628) at
nodeFunctionscan.c:94
#3  0x00678f0e in ExecScanFetch (node=0x2126628, accessMtd=0x689d23
, recheckMtd=0x68a108 ) at execScan.c:95
#4  0x00678f7d in ExecScan (node=0x2126628, accessMtd=0x689d23
, recheckMtd=0x68a108 ) at execScan.c:143

4. If you agree with the reason mentioned by me in comment #3, please
remove the following if statement from bt_page_items(),




*if (P_ISMETA(opaque))ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),  errmsg("block
is a meta page")));*

Apart from above comments, your patch looks good to me. I have also marked
this patch as 'Waiting for Author' in the commitfest. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-04-01 Thread Mithun Cy
On Sat, Apr 1, 2017 at 12:31 PM, Mithun Cy  wrote:

> Also adding a patch which implements the 2nd way.
Sorry, I forgot to add sortbuild_hash patch, which also needs similar
changes for the hash_mask.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


sortbuild_hash_A_2.patch
Description: Binary data


yet_another_expand_hashbucket_efficiently_15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make check-world output

2017-04-01 Thread Noah Misch
On Sat, Mar 11, 2017 at 11:14:36AM -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Tom Lane wrote:
> >> What about just reverting 2f227656076a?
> 
> > That works for me too, if we think we no longer need that level of
> > detail.
> 
> A general issue with this sort of messaging is that when things are
> working more or less normally, you'd just as soon not see it ... but
> when you need to debug problems with the test scaffolding itself,
> you want verbosity.

Yep, and it's important for buildfarm logs to capture the verbose output.

> An entirely different big-picture point is "what the heck are we
> doing using a shell script here at all?  It is useless for testing
> on Windows".  Somebody should look into rewriting it in Perl,
> perhaps using the facilities of PostgresNode.pm.

The pg_upgrade test suite originated in an age when "make check-world" was
forbidden to depend on Perl; the choice was a shell script or a C program.  We
do maintain vcregress.pl:upgradecheck(), a Windows-specific Perl port of the
suite.  Maintaining both versions has been tedious.  I'd welcome a
high-quality rewrite using src/test/perl facilities.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-04-01 Thread Ashutosh Sharma
Hi,

>
> On 03/29/2017 09:16 PM, Ashutosh Sharma wrote:
>>>
>>> This patch needs a rebase.
>>
>>
>> Please try applying these patches on top of [1]. I think you should be
>> able
>> to apply it cleanly. Sorry, I think I forgot to mention this point in my
>> earlier mail.
>>
>> [1] -
>>
>> https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com
>>
>
> Thanks, that works !
>
> As you have provided a patch for Robert's comments, and no other review have
> been posted I'm moving this patch to "Ready for Committer" for additional
> committer feedback.

Please find the attached new version of patches for page scan mode in
hash index rebased on top of - [1].

[1] - 
https://www.postgresql.org/message-id/CAE9k0P%3D3rtgUtxopG%2BXQVxgASFzAnGd9sNmYUaj_%3DKeVsKGUdA%40mail.gmail.com
From 498723199f4b14ff9917aca13abf30f9ea261ca7 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Sat, 1 Apr 2017 12:09:46 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev5

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README   |   9 +-
 src/backend/access/hash/hash.c   | 140 ++-
 src/backend/access/hash/hashpage.c   |  17 +-
 src/backend/access/hash/hashsearch.c | 441 +++
 src/backend/access/hash/hashutil.c   |  29 ++-
 src/include/access/hash.h|  44 
 6 files changed, 509 insertions(+), 171 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
- the pin on the primary bucket throughout the scan; we also maintain
- a pin on the page currently being scanned)
-	get tuple
-	release content lock
+ the pin on the primary bucket throughout the scan)
+save all the matching tuples from current index page into an items array
+release pin and content lock (but if it is primary bucket page retain
+it's pin till the end of scan)
+get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index b835f77..bd2827a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,23 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
- RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -346,9 +303,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
 			{
-so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-

Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-04-01 Thread Mithun Cy
On Sat, Apr 1, 2017 at 7:05 AM, Robert Haas  wrote:
> Hmm, don't the changes to contrib/pageinspect/expected/hash.out
> indicate that you've broken something?  The hash index has only 4
> buckets, so the new code shouldn't be doing anything differently, but
> you've got highmask and lowmask changing for some reason.

The high mask calculation has got changed a bit to accommodate
num_buckets  which are not 2-power number.
If num_bucket is not a 2-power number highmask should be its immediate
next ((2^x) - 1) and low mask should be (highmask >> 1) to cover the
first half of the buckets. Trying to generalize same has changed the
masks for 2-power num_buckets from older implementation.

+ /* set highmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;

But this do not cause any adverse effect the high and low mask is
sufficiently enough to get the same mod, If we add one more bucket
then
@_hash_expandtable, immediately we make the masks bigger.
if (new_bucket > metap->hashm_highmask)
{
/* Starting a new doubling */
metap->hashm_lowmask = metap->hashm_highmask;
metap->hashm_highmask = new_bucket | metap->hashm_lowmask;

The state (metap->hashm_highmask == metap->hashm_maxbucket) is natural
state to occur while hash index is growing and just before doubling.

Another choice I could have made is bump a number so that for 2-power
num_buckets will get highmask as similar to old code, and non 2-power
num_buckets highmask will be immediate next ((2^x) - 1).
+ /* set highmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets + 1))) - 1;

It was just a personal preference I choose 1, as it appeared
consistent with running state of hash index expansion.

Also adding a patch which implements the 2nd way.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


yet_another_expand_hashbucket_efficiently_15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-04-01 Thread Amit Kapila
On Fri, Mar 31, 2017 at 11:54 PM, Pavan Deolasee
 wrote:
>
> On Fri, Mar 31, 2017 at 11:16 PM, Robert Haas  wrote:
>>
>>   Now, I understand you to be suggesting a flag at
>> table-creation time that would, maybe, be immutable after that, but
>> even then - are we going to run completely unmodified 9.6 code for
>> tables where that's not enabled, and only go through any of the WARM
>> logic when it is enabled?  Doesn't sound likely.  The commits already
>> made from this patch series certainly affect everybody, and I can't
>> see us adding switches that bypass
>> ce96ce60ca2293f75f36c3661e4657a3c79ffd61 for example.
>
>
> I don't think I am going to claim that either. But probably only 5% of the
> new code would then be involved. Which is a lot less and a lot more
> manageable. Having said that, I think if we at all do this, we should only
> do it based on our experiences in the beta cycle, as a last resort. Based on
> my own experiences during HOT development, long running pgbench tests, with
> several concurrent clients, subjected to multiple AV cycles and periodic
> consistency checks, usually brings up issues related to heap corruption. So
> my confidence level is relatively high on that part of the code. That's not
> to suggest that there can't be any bugs.
>
> Obviously then there are other things such as regression to some workload or
> additional work required by vacuum etc. And I think we should address them
> and I'm fairly certain we can do that. It may not happen immediately, but if
> we provide right knobs, may be those who are affected can fall back to the
> old behaviour or not use the new code at all while we improve things for
> them.
>

Okay, but even if we want to provide knobs, then there should be some
consensus on those.  I am sure introducing an additional pass over
index has some impact so either we should have some way to reduce the
impact or have some additional design to handle it.  Do you think it
make sense to have a separate thread to discuss and get feedback on
same as I am not seeing much input on the knobs you are proposing to
handle second pass over index?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers