[GENERAL] How to change order sort of table in HashJoin

2016-11-18 Thread Man Trieu
Hi Experts,

As in the example below, i think the plan which hash table is created on
testtbl2 (the fewer tuples) should be choosen.
Because creating of hash table should faster in testtbl2. But it did not.

I have tried to change the ordering of table by tuning parameter even if
using pg_hint_plan but not success.

Why does planner do not choose the plan which hash table is created on
testtbl2 (which can take less time)?
And how to change the order?

# I also confirm planner info by rebuild postgresql but not found related
usefull info about hash table

---
postgres=# create table testtbl1(id integer, c1 text, c2 text, c3 text,
primary key (c1,c2,c3));
CREATE TABLE
postgres=# create table testtbl2(id integer, c1 text, c2 text, c3 text,
primary key (c1,c2,c3));
CREATE TABLE
postgres=# insert into testtbl1 select
generate_series(1,100),random()::text,random()::text,random()::text;
INSERT 0 100
postgres=# insert into testtbl2 select * from testtbl1 where id%7 = 0;
INSERT 0 142857

postgres=# explain analyze select * from testtbl1 inner join testtbl2
using(c1,c2,c3);
   QUERY PLAN
-
 Hash Join  (cost=38775.00..47171.72 rows=1 width=59) (actual
time=1120.824..1506.236 rows=142857 loops=1)
   Hash Cond: ((testtbl2.c1 = testtbl1.c1) AND (testtbl2.c2 = testtbl1.c2)
AND (testtbl2.c3 = testtbl1.c3))
   ->  Seq Scan on testtbl2  (cost=0.00..3039.57 rows=142857 width=56)
(actual time=0.008..27.964 rows=142857 loops=1)
   ->  Hash  (cost=21275.00..21275.00 rows=100 width=55) (actual
time=1120.687..1120.687 rows=100 loops=1)
 Buckets: 131072  Batches: 1  Memory Usage: 89713kB
 ->  Seq Scan on testtbl1  (cost=0.00..21275.00 rows=100
width=55) (actual time=0.035..458.522 rows=100 loops=1)
 Planning time: 0.922 ms
 Execution time: 1521.258 ms
(8 rows)

postgres=# set pg_hint_plan.enable_hint to on;
SET
postgres=# /*+
postgres*# HashJoin(testtbl1 testtbl2)
postgres*# Leading(testtbl1 testtbl2)
postgres*# */
postgres-# explain analyze select * from testtbl1 inner join testtbl2
using(c1,c2,c3);
   QUERY PLAN
-
 Hash Join  (cost=48541.00..67352.86 rows=1 width=59) (actual
time=1220.625..1799.709 rows=142857 loops=1)
   Hash Cond: ((testtbl2.c1 = testtbl1.c1) AND (testtbl2.c2 = testtbl1.c2)
AND (testtbl2.c3 = testtbl1.c3))
   ->  Seq Scan on testtbl2  (cost=0.00..3039.57 rows=142857 width=56)
(actual time=0.011..58.649 rows=142857 loops=1)
   ->  Hash  (cost=21275.00..21275.00 rows=100 width=55) (actual
time=1219.295..1219.295 rows=100 loops=1)
 Buckets: 8192  Batches: 32  Memory Usage: 2851kB
 ->  Seq Scan on testtbl1  (cost=0.00..21275.00 rows=100
width=55) (actual time=0.021..397.583 rows=100 loops=1)
 Planning time: 3.971 ms
 Execution time: 1807.710 ms
(8 rows)

postgres=#
---


Thanks and best regard!


[GENERAL] Partial update on an postgres upsert violates constraint

2016-11-18 Thread Andreas Terrius
Hi,
Basically I wanted to do a partial update inside pg (9.5), but it seems
that a partial update fails when not all of constraint is fulfilled (such
as the not null constraint)

Below are the sql queries I used,

CREATE TABLE jobs (
id integer PRIMARY KEY,
employee_name TEXT NOT NULL,
address TEXT NOT NULL,
phone_number TEXT);
CREATE OR REPLACE FUNCTION upsert_job(job JSONB)
RETURNS VOID AS $$BEGININSERT INTO jobs AS origin VALUES(
(job->>'id')::INTEGER,
job->>'employee_name'::TEXT,
job->>'address'::TEXT,
job->>'phone_number'::TEXT) ON CONFLICT (id) DO UPDATE SET
employee_name = COALESCE(EXCLUDED.employee_name, origin.employee_name),
address = COALESCE(EXCLUDED.address, origin.address),
phone_number = COALESCE(EXCLUDED.phone_number,
origin.phone_number);END;$$ LANGUAGE PLPGSQL SECURITY DEFINER;

--Full insert (OK)SELECT upsert_job('{"id" : 1, "employee_name" :
"AAA", "address" : "City, x street no.y", "phone_number" :
"123456789"}'::jsonb);
--Partial update that fulfills constraint (Ok)SELECT upsert_job('{"id"
: 1,  "employee_name" : "BBB", "address" : "City, x street
no.y"}'::jsonb);
--Partial update that doesn't fulfill constraint (FAILS)SELECT
upsert_job('{"id" : 1,  "phone_number" : "12345"}'::jsonb);
--ERROR:  null value in column "employee_name" violates not-null
constraint--DETAIL:  Failing row contains (1, null, null, 12345).

I also tried explicitly stating the columns that I wanted to insert, and it
also fails. How do I go around doing this ?

Thank you


Re: [GENERAL] Avoiding double-counting in aggregates with more than one join?

2016-11-18 Thread David G. Johnston
On Fri, Nov 18, 2016 at 10:16 AM, Paul Jungwirth <
p...@illuminatedcomputing.com> wrote:

> But is there a better way?


​Nothing that would be more readable nor likely more performant.

When performing aggregation it is necessary to limit the scope of the query
to only whatever it is you are calculating.  Since you wish to compute two
things you need two separate parts ​plus a third to combine them.

​If performance is a concern you should move the aggregation queries
directly to the main query instead of using the optimization fencing CTE.

SELECT
FROM products
LEFT JOIN (
SELECT sum()
)​ s USING (product_id)
LEFT JOIN (
SELECT sum()
) r USING (product_id)

​If the second "scope" doesn't need to be calculated but simply informs the
one-and-only aggregate you should use SEMI JOIN (EXISTS) instead of a
INNER/LEFT JOIN​.  But that isn't what you have here.

David J.


[GENERAL] Avoiding double-counting in aggregates with more than one join?

2016-11-18 Thread Paul Jungwirth

Hi All,

I've noticed in the past that doing aggregates while joining to more 
than one table can sometimes give you unintended results. For example, 
suppose I have three tables: products, sales, and resupplies. In sales I 
track what I sell, and in resupplies I track my own purchases to 
increase inventory. Both have a foreign key to products. Now I want to 
run a report showing the total dollars sold for each product versus the 
total dollars spent for each product. I could try this:


SELECT  p.id,
SUM(s.price * s.qty) AS total_sold,
SUM(r.price * r.qty) AS total_spent
FROMproducts p
LEFT OUTER JOIN sales s
ON  s.product_id = p.id
LEFT OUTER JOIN resupplies r
ON  r.product_id = p.id
GROUP BY p.id
;

That seems pretty safe, but actually I get bad answers,
for example if some product has this data:

sales
-
sold 1 @ $2/ea

resupplies
--
bought 1 @ $1/eq
bought 2 @ $1/ea

Then pre-grouping I have this:

p.id | s.qty | s.price | r.qty | r.price
-+---+-+---+
   1 | 1 |  $2 | 1 |  $1
   1 | 1 |  $2 | 2 |  $1

You can see the problem is that I'm going to double-count my sales.
What I really want is this:

p.id | s.qty | s.price | r.qty | r.price
-+---+-+---+
   1 | 1 |  $2 | 1 |  $1
   1 |   | | 2 |  $1

In the past I've always handled these situations by aggregating each 
table separately

and only then joining things together:

WITH
s AS (
  SELECT  product_id,
  SUM(price * qty) AS total_sold
  FROMsales
  GROUP BY product_id) s
),
r AS (
  SELECT  product_id,
  SUM(price * qty) AS total_spent
  FROMresupplies
  GROUP BY product_id) r
)
SELECT  p.id,
COALESCE(s.total_sold, 0),
COALESCE(r.total_spent, 0)
FROMproducts p
LEFT OUTER JOIN s
ON  s.product_id = p.id
LEFT OUTER JOIN r
ON  r.product_id = p.id
;

Since I've guaranteed that each CTE includes at most one row per product,
this is safe from double-counting errors.

But is there a better way? My approach feels verbose
and harder to read. Also you have to type COALESCE a lot. :-)
Is there some different way of doing things I haven't thought of yet?

Also I wonder about the performance merging all these subqueries together.
Would the final merging be any faster if I had an ORDER BY in each CTE?

It seems like this pattern comes up a lot;
what have others done about it?

Thanks,
Paul


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trim performance on 9.5

2016-11-18 Thread William Ivanski
I just ran EXPLAIN ANALYZE, please see images attached. Field doesn't have
a index.

Em sex, 18 de nov de 2016 às 12:16, vinny  escreveu:

> On 2016-11-18 15:06, William Ivanski wrote:
> > Hi,
> >
> > I recently did major improvements on perfomance on our routines by
> > simply removing the call for trim functions on specific bottlenecks.
> > Please see images attached for a simple example.
> >
> > I'm using PostgreSQL version 9.5.5-1.pgdg80+1 on Debian 8.6. Someone
> > knows if it's a bug on trim function? Thanks in advance.
> >
> > --
> >
> > William Ivanski
>
> Did you run EXPLAIN on these queries?
>
> I'm guessing that you have an index on the field, but not on
> TRIM(field),
> which would mean that the database is forced to seqscan to fetch every
> row value, trim it and then compare it.
>
-- 

William Ivanski

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pgbench and scaling

2016-11-18 Thread Vick Khera
On Thu, Nov 17, 2016 at 8:08 PM, Rakesh Kumar
 wrote:
> I noticed that as I scale from 5 to 10 to 20 to 40, the TPS starts falling 
> almost linearly :
>
> with 5, TPS was doing 639
> with 10 TPS was down to 490
> with 20 TPS was down to 280
> and so on.

Are the TPS numbers per pgbench? If so, then you're getting
10x490=4900 TPS system wide, or 20*280=5600 TPS system wide.

If those are total TPS numbers then you should check your disk
utilization and CPU utilization and see where your bottleneck is.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Streaming replication failover/failback

2016-11-18 Thread Jehan-Guillaume de Rorthais
On Thu, 17 Nov 2016 08:26:59 -0900
Israel Brewster  wrote:

> > On Nov 16, 2016, at 4:24 PM, Adrian Klaver 
> > wrote:
> > 
> > On 11/16/2016 04:51 PM, Israel Brewster wrote:  
> >> I've been playing around with streaming replication, and discovered that
> >> the following series of steps *appears* to work without complaint:
> >> 
> >> - Start with master on server A, slave on server B, replicating via
> >> streaming replication with replication slots.
> >> - Shut down master on A
> >> - Promote slave on B to master
> >> - Create recovery.conf on A pointing to B
> >> - Start (as slave) on A, streaming from B
> >> 
> >> After those steps, A comes up as a streaming replica of B, and works as
> >> expected. In my testing I can go back and forth between the two servers
> >> all day using the above steps.
> >> 
> >> My understanding from my initial research, however, is that this
> >> shouldn't be possible - I should need to perform a new basebackup from B
> >> to A after promoting B to master before I can restart A as a slave. Is
> >> the observed behavior then just a "lucky fluke" that I shouldn't rely  
> > 
> > You don't say how active the database is, but I going to say it is not
> > active enough for the WAL files on B to go out for scope for A in the time
> > it takes you to do the switch over.  
> 
> Yeah, not very - this was just in testing, so essentially no activity. So
> between your response and the one from Jehan-Guillaume de Rorthais, what I'm
> hearing is that my information about the basebackup being needed was
> obsoleted with the patch he linked to, and as long as I do a clean shutdown
> of the master, and don't do too much activity on the *new* master before
> bringing the old master up as a slave (such that WAL files are lost)

Just set up wal archiving to avoid this (and have PITR backup as a side effect).


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trim performance on 9.5

2016-11-18 Thread vinny

On 2016-11-18 15:06, William Ivanski wrote:

Hi,

I recently did major improvements on perfomance on our routines by
simply removing the call for trim functions on specific bottlenecks.
Please see images attached for a simple example.

I'm using PostgreSQL version 9.5.5-1.pgdg80+1 on Debian 8.6. Someone
knows if it's a bug on trim function? Thanks in advance.

--

William Ivanski


Did you run EXPLAIN on these queries?

I'm guessing that you have an index on the field, but not on 
TRIM(field),
which would mean that the database is forced to seqscan to fetch every 
row value, trim it and then compare it.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Trim performance on 9.5

2016-11-18 Thread William Ivanski
Hi,

I recently did major improvements on perfomance on our routines by simply
removing the call for trim functions on specific bottlenecks. Please see
images attached for a simple example.

I'm using PostgreSQL version 9.5.5-1.pgdg80+1 on Debian 8.6. Someone knows
if it's a bug on trim function? Thanks in advance.

-- 

William Ivanski

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How the Planner in PGStrom differs from PostgreSQL?

2016-11-18 Thread Kouhei Kaigai
> On Thu, Nov 17, 2016 at 7:09 PM, Mark Anns  wrote:
> >  Can u explain this statement "check whether the scan qualifier can be
> > executable on GPU device"
> >
> > What are the scan qualifiers?
> >
> > How to determine whether they are device executable or not?
> >
> > The cost estimates are entirely based on number of rows and type of scan.
> > Then it will be same for both CPU and GPU. How the decision can be
> > made for cheaper one comparing CPU and GPU estimates?
> 
> There is a parameter (call it a "factor" if you will) called cpu_tuple_cost
> (section 19.7.2 Planner Cost Constants in the PostgreSQL documentation;
> link at the bottom), which "sets the planner's estimate of the cost of
> processing each row during a query"
> as the description on that page says.
>
> With that as the unit of cost of processing rows using the CPU and considering
> any differences in the processing capabilities between CPU and GPU, the
> optimizer code will cost the portion of plan that will be processed by the
> GPU (typically a table scan or a join) as some fraction of the cost of
> executing the same portion of the plan using the traditional CPU processing.
> 
> (Kaigai-san will be better able to explain and correct if the above rough
> sketch is not exactly accurate)
>
It is right introduction.

PG-Strom assumes GPU can run functions/operators within scan qualifier
more effectively than CPU, but has more startup cost (const) and extra
data copy via PCI-E bus (another factor based on width x num rows).
These factor makes differences in the cost of individual scan/join paths,
then planner will choose the appropriate one.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How the Planner in PGStrom differs from PostgreSQL?

2016-11-18 Thread Kouhei Kaigai
>  Can u explain this statement "check whether the scan qualifier can be
> executable on GPU device"
> 
> What are the scan qualifiers?
> 
SELECT * FROM my_table WHERE x > 20 AND memo LIKE '%abc%';
 
This is scan qualifier.

> How to determine whether they are device executable or not?
>
If all the function (or function on behalf of operator) are
available to transform GPU source code. Please see codegen.c.

> The cost estimates are entirely based on number of rows and type of scan.
> Then it will be same for both CPU and GPU. How the decision can be made
> for cheaper one comparing CPU and GPU estimates?
>
If Scan path has any scan qualifier, its cost to evaluate depends on
the device type. PG-Strom assumes GPU has larger startup cost but
less cost per tuple. So, GpuScan path is tend to be choosen if number
of rows are relatively large.

Thanks,
--
The PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Feature request: separate logging

2016-11-18 Thread otheus uibk
A glaring weakness in Postgresql for production systems is that the
administrator has no way of controlling what types of logs go where. There
are at least two types of logs: errors and statement logs. (I could also
add: connection, syntax error, query duration, audit). It has becomes
increasingly important in the Linux world, with its over reliance on
systemd, that we admins be able to distinguish between these log outputs.
Systemd wants to control all the daemons and services run on a Linux host.
To do this effectively, it needs to capture the stdout and stderr or
possibly parse the external logfile of the daemon it spawns. The benefits
of systemd's journald subsystem include being able to automatically
identify daemons that are failing to start properly. (The merits of systemd
are beyond the scope of this discussion; it is now the all-but-ubiquitous
standard of linux distributions, and has become nearly intractable).

The Postgresql startup process could greatly benefit from systemd. As it is
now, PGDG distributes postgresql with init scripts which really do very
little to check if postgresql bombed shortly after startup. I have improved
upon that script to do some smart detection of a failed startup, but the
result is having two different files to monitor. On the one hand, I want to
use postgresql's built-in file or CSV logging. On the other, I need to
check the output of the startup process. These logs have different formats,
especially for the smart DBA who wants to customize the log prefix with a
more usable timestamp and other fields. Logging to external files is nice
because postgresql rotates the logs for us automatically. (External log
rotation is problematic and risky because it requires a configuration
reload which may lead to undesirable side-effects. Right?)

One alternative is to capture everything into the local logging system, or
to send all to stdout/stderr and capture this with journald (or runit's
svlogd or something). But then you have the following problem: if I enable
statement-level logging or set log_min_duration_statement=0, the in-memory
journald will quickly be overrun with statement logs. The journald
subsystem will then become useless to the other daemons and subsystems.

One solution is proposed in these forums: https://www.
postgresql.org/message-id/flat/etPan.53397e77.643c9869.
1a0%40palos#etPan.53397e77.643c9869.1a0@palos
While pgbadger has some nice features, it doesn't really solve the problem
of allowing postgresql to be used with systemd+journald.

What I do today is to configure postgresql to write csvlogs. Stdout/stderr
are captured by journald. A custom perl script with the Text::CSV module
and tail -F semantics continuously processes the csvlog file, ignores
query, dml, and detail log lines,  and sends the rest via syslog() (which
journald then handles).

It's not the right way.

I would like to see postgresql to have the ability to

   1. Write to a csvlog with one set of selectors
   2. Write to stdout/stderr a different set of selectors (no statement, no
autovacuum, etc) using a purely line-oriented output that

   2.1. has the kind of detail contained in the CSV.  Currently, the
log-prefix option does not offer some of the information provided in the
CSV logs. Really, the CSV log should simply be an implementation of the
log-prefix.
   2.2. Collapses multi-lined queries into one line (newlines and tabs
are escaped with backslashes or the x1B character).

Finally, if these changes can be implemented, is it impossible to backport
them to prior versions, say 9.1 and up? If I wrote a patch, under what
conditions would the patch be accepted for inclusion in official releases
of older versions?


-- 
Otheus
otheus.u...@gmail.com
otheus.shell...@uibk.ac.at


Re: [GENERAL] Full text search tsv column aproach vs concat confusion

2016-11-18 Thread Artur Zakirov

Hi,

On 16.11.2016 11:54, cen wrote:

Hi

I am seeking some clarification in regard to full text search across
multiple tables and what the best approach is. Documentation talks about
two approaches when it comes to building a document: on-the-fly concat
of columns and a dedicated tsv column approach. Let's say I want to
perform a search for |"txt1 & txt2 & txt3" on columns table1.col1,
table1.col2 and table2.col1. I see the following solutions:|

|1. Concat all three into a document and perform a FTS.|

|SELECT * FROM (
|

|SELECTto_tsvector(table1.col1)||to_tsvector(table1.col2)||to_tsvector(table2.col1)asdocument
F|||ROM table1 LEFTJOINtable2 ONtable1.table2_id=table2.id| ) subquery 
|||WHEREsubquery.document@@to_tsquery(unaccent(?));| |

|2. Create a tsv column in each table, concat tsv columns and perform
FTS on that.|

|SELECT*FROMtable1 LEFTJOINtable2 ONtable1.table2_id=table2.id
WHEREtable1.tsv ||tale2.tsv @@to_tsquery(unaccent(?));|

|3. Have a tsv column only in table1 and insert table2.col1 to the tsv
via triggers. Works but seems very hacky.|

|
|

|It seems to me that option #2 is fast and easy to implement but I am
not sure what the concat of tsvs really means from index usage and
performance standpoint. Option #1 is the most flexible and I'd use that
all the time if it was not THAT much slower than tsv column approacj.
Documentation on TSV columns states: "||Another advantage is that
searches will be faster, since it will not be necessary to redo the
to_tsvector calls to verify index matches."
|

The question is, how much faster are tsv columns really? Are there any
benchmarks about this? If the performance difference is negligible I'd
advocate that using tsv columns is a waste of time and space in most
general cases. But since there is no information on how much faster it's
hard to decide.



I haven't any such benchmarks. But if you have a real database, you can 
perform tests using it on your solutions. Because it depends on your 
task and what you need.


By the way, I suppose it is better to use COALESCE() function if your 
columns could have NULL value:


SELECT * FROM (

SELECT to_tsvector(coalesce(table1.col1,'')) ||
   to_tsvector(coalesce(table1.col2,'')) ||
   to_tsvector(coalesce(table2.col1,'')) as document FROM table1 
LEFT JOIN table2 ON table1.table2_id=table2.id


) subquery WHERE subquery.document @@ to_tsquery(unaccent(?));

And specifying a text search configuration makes queries a little bit 
faster:


... to_tsvector('english', coalesce(table1.col1,'')) ...

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general