Re: [GENERAL] query not scaling

2017-10-31 Thread Rob Sargent

On 10/31/2017 03:12 AM, Laurenz Albe wrote:

Rob Sargent wrote:

I think your biggest problem is the join condition
 on m.basepos between s.startbase and s.endbase

That forces a nested loop join, which cannot be performed efficiently.

Agree! 800,000 * 4,000 = 3,200,000,000.  It's just that I thought I had
corralled that problem which indexing but apparently not.  I was hoping
some kind soul might point out a missing index or similar correction. I
have completely reworked the process, but not sure yet if it's correct.
(The slow answer is correct, once it comes in.)

You can create indexes that are useful for this query:

ON sui.segment(chrom, markerset_id)
ON sui.probandset(people_id)

But that probably won't make a big difference, because the sequential
scans take only a small fraction of your query time.

A little less than half of the query time is spent in the nested loop
join, and a little more than half of the time is spent doing the
GROUP BY.

Perhaps the biggest improvement you can easily make would be to
get rid of "numeric" for the computation.  I suspect that this is
where a lot of time is spent, since the hash aggregate is over
less than 15000 rows.

Unless you really need the precision of "numeric", try

CREATE OR REPLACE FUNCTION pv(l bigint, e bigint, g bigint, o int)
RETURNS double precision LANGUAGE sql AS
$$SELECT (g+e+o)::double precision / (l+e+g+o)::double precision$$;

Yours,
Laurenz Albe
In practice markersets are always aligned with one chromosome so I would 
not expect this to have an effect.  There's no constraint on this 
however, and there can be more than one markerset per chromosome.  I 
have played with indexing on segment.markerset_id.


In all the data sets used in the examples (runtimes, explains etc) there 
has been a in single people_id across the existing segment data.  Down 
the road this of course will not be the case and I can see the value of 
an index on probandset.people_id eventually. I can certainly add it now 
for a test.  I'm currently writing a probandset loader hoping to get a 
test case for the problem with gin indexing mentioned up-thread.


I think I'm most surprise at the notion that the arithmetic is the 
problem and will happily test your suggestion to force floating point 
values.  The value can get small (10^-12 on a good day!) but we don't 
need many digits of precision.


Thanks



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] query not scaling

2017-10-30 Thread Laurenz Albe
On Thu, 2017-10-26 at 19:07 -0600, Rob Sargent wrote:
>    ->  Nested Loop  (cost=3799.40..44686205.23 rows=1361304413 width=40)
> (actual time=55.443..89684.451 rows=75577302 loops=1)

>  ->  Hash Join  (cost=3798.98..43611.56 rows=823591 width=32)
> (actual time=55.393..1397.509 rows=823591 loops=1)

>  ->  Index Scan using marker_chrom_basepos_idx on base.marker m
>     (cost=0.42..37.67 rows=1653 width=20)
> (actual time=0.010..0.075 rows=92 loops=823591)
>  Index Cond: ((m.chrom = 22) AND (m.basepos >= s.startbase) AND
(m.basepos <= s.endbase))

I think your biggest problem is the join condition
   on m.basepos between s.startbase and s.endbase

That forces a nested loop join, which cannot be performed efficiently.

Yours,
Laurenz Albe


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] query not scaling

2017-10-27 Thread Merlin Moncure
On Thu, Oct 26, 2017 at 10:01 AM, Tom Lane  wrote:
> Laurenz Albe  writes:
>> Also, to have PostgreSQL inline the function, which would be good
>> for performance, it should be declared IMMUTABLE.
>
> Actually, if you hope to have a SQL function be inlined, it's better
> not to decorate it at all --- not with IMMUTABLE, and not with STRICT
> either.  Both of those restrict the parser's ability to inline unless
> it can prove the contained expression is equally immutable/strict.
> With the default attributes of volatile/not strict, there's nothing
> to prove.

This is extremely obnoxious.  Is it possible to raise a warning on
function creation?

> (In any case, it's usually easy enough to tell from EXPLAIN output
> whether inlining has happened.)

No it isn't.  The explain syntax is arcane and inlining as a general
concept is only very indirectly expressed.  I really think we ought to
do better here; I was not able to find any treatment of inlining given
in the 'Performance Tips' or the 'Functions and Operators' section, or
anywhere really (except the wiki).  This is really a disservice to the
users, I think.

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] query not scaling

2017-10-26 Thread Rob Sargent


On 10/26/2017 09:01 AM, Tom Lane wrote:

Laurenz Albe  writes:

Also, to have PostgreSQL inline the function, which would be good
for performance, it should be declared IMMUTABLE.

Actually, if you hope to have a SQL function be inlined, it's better
not to decorate it at all --- not with IMMUTABLE, and not with STRICT
either.  Both of those restrict the parser's ability to inline unless
it can prove the contained expression is equally immutable/strict.
With the default attributes of volatile/not strict, there's nothing
to prove.

(In any case, it's usually easy enough to tell from EXPLAIN output
whether inlining has happened.)

regards, tom lane


In another instance of the same schema, in same database as original
slow execution I've loaded 823591 segments (though in this case all
of them are on one chromosome, one markerset)**and 65K proband sets
using same marker table as the slow(est) query.  In the fastest run,
there are only 46K segments for the given markerset.



QUERY PLAN
--
 HashAggregate  (cost=82122076.59..8215.35 rows=14876 width=48) 
(actual time=208203.091..208210.348 rows=14645 loops=1)
   Output: m.id, min(((1.0 * (((s.events_greater + s.events_equal) + 
0))::numeric) / s.events_less + s.events_equal) + s.events_greater) 
+ 0))::numeric))

   Group Key: m.id
   Buffers: shared hit=43209090
   ->  Nested Loop  (cost=3799.40..44686205.23 rows=1361304413 
width=40) (actual time=55.443..89684.451 rows=75577302 loops=1)

 Output: m.id, s.events_greater, s.events_equal, s.events_less
 Buffers: shared hit=43209090
 ->  Hash Join  (cost=3798.98..43611.56 rows=823591 width=32) 
(actual time=55.393..1397.509 rows=823591 loops=1)
   Output: s.events_greater, s.events_equal, s.events_less, 
s.startbase, s.endbase

   Inner Unique: true
   Hash Cond: (s.probandset_id = p.id)
   Buffers: shared hit=19222
   ->  Seq Scan on sui.segment s (cost=0.00..29414.86 
rows=823591 width=48) (actual time=0.017..669.915 rows=823591 loops=1)
 Output: s.id, s.chrom, s.markerset_id, 
s.probandset_id, s.startbase, s.endbase, s.firstmarker, s.lastmarker, 
s.events_less, s.events_equal, s.events_greater
 Filter: ((s.chrom = 22) AND (s.markerset_id = 
'edf95066-24d2-4ca1-bad6-aa850cc82fef'::uuid))

 Buffers: shared hit=17061
   ->  Hash  (cost=2979.99..2979.99 rows=65519 width=16) 
(actual time=55.272..55.272 rows=65519 loops=1)

 Output: p.id
 Buckets: 65536  Batches: 1  Memory Usage: 3584kB
 Buffers: shared hit=2161
 ->  Seq Scan on sui.probandset p 
(cost=0.00..2979.99 rows=65519 width=16) (actual time=0.007..33.582 
rows=65519 loops=1)

   Output: p.id
   Filter: (p.people_id = 
'9b2308b1-9659-4a2c-91ae-8f95cd0a90b3'::uuid)

   Buffers: shared hit=2161
 ->  Index Scan using marker_chrom_basepos_idx on base.marker 
m  (cost=0.42..37.67 rows=1653 width=20) (actual time=0.010..0.075 
rows=92 loops=823591)

   Output: m.id, m.name, m.chrom, m.basepos, m.alleles
   Index Cond: ((m.chrom = 22) AND (m.basepos >= 
s.startbase) AND (m.basepos <= s.endbase))

   Buffers: shared hit=43189868
 Planning time: 0.764 ms
 Execution time: 208214.816 ms
(30 rows)



Re: [GENERAL] query not scaling

2017-10-26 Thread Rob Sargent



On 10/26/2017 09:01 AM, Tom Lane wrote:

Laurenz Albe  writes:

Also, to have PostgreSQL inline the function, which would be good
for performance, it should be declared IMMUTABLE.

Actually, if you hope to have a SQL function be inlined, it's better
not to decorate it at all --- not with IMMUTABLE, and not with STRICT
either.  Both of those restrict the parser's ability to inline unless
it can prove the contained expression is equally immutable/strict.
With the default attributes of volatile/not strict, there's nothing
to prove.

(In any case, it's usually easy enough to tell from EXPLAIN output
whether inlining has happened.)

regards, tom lane

As to the explain analyze,

   could not receive data from server: Connection timed out
   Time: 7877340.565 ms

for the second time.  I had presumed at first that this had occurred 
during a network burp.  I'll try running it directly on the pg host.




Re: [GENERAL] query not scaling

2017-10-26 Thread Tom Lane
Laurenz Albe  writes:
> Also, to have PostgreSQL inline the function, which would be good
> for performance, it should be declared IMMUTABLE.

Actually, if you hope to have a SQL function be inlined, it's better
not to decorate it at all --- not with IMMUTABLE, and not with STRICT
either.  Both of those restrict the parser's ability to inline unless
it can prove the contained expression is equally immutable/strict.
With the default attributes of volatile/not strict, there's nothing
to prove.

(In any case, it's usually easy enough to tell from EXPLAIN output
whether inlining has happened.)

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] query not scaling

2017-10-26 Thread Rob Sargent


> On Oct 26, 2017, at 1:02 AM, Laurenz Albe  wrote:
> 
> Rob Sargent wrote:
>> I have a query I cannot tame and I'm wondering if there's an alternative
>> to  the "between"  clause I'm  using.  Perhaps  a custom  type could  do
>> better?  I've  tried the "<@" orperator and that changes the  query plan
>> significantly but the execution cost/time is not improved.
>> 
>> Any suggestion or pointers much appreciated.
> 
> You really need EXPLAIN (ANALYZE, BUFFERS) output to be able to
> undersrand what is going on.
> 
> A couple of simple things to check:
> 
> - Have all tables been ANALYZED beforehand?
> - Are all optimizer database parameters identical?
> 
> Also, to have PostgreSQL inline the function, which would be good
> for performance, it should be declared IMMUTABLE.
> 
> Yours,
> Laurenz Albe
The explain analyze was (maybe is)still running but without buffers. 
Thought of the immutable bit. Will be doing that test. 
All tables vacuumed and analyzed with each structural change. 
Thanks


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] query not scaling

2017-10-26 Thread Laurenz Albe
Rob Sargent wrote:
> I have a query I cannot tame and I'm wondering if there's an alternative
> to  the "between"  clause I'm  using.  Perhaps  a custom  type could  do
> better?  I've  tried the "<@" orperator and that changes the  query plan
> significantly but the execution cost/time is not improved.
> 
> Any suggestion or pointers much appreciated.

You really need EXPLAIN (ANALYZE, BUFFERS) output to be able to
undersrand what is going on.

A couple of simple things to check:

- Have all tables been ANALYZED beforehand?
- Are all optimizer database parameters identical?

Also, to have PostgreSQL inline the function, which would be good
for performance, it should be declared IMMUTABLE.

Yours,
Laurenz Albe


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] query not scaling

2017-10-25 Thread Rob Sargent

I have a query I cannot tame and I'm wondering if there's an alternative
to  the "between"  clause I'm  using.  Perhaps  a custom  type could  do
better?  I've  tried the "<@" orperator and that changes the  query plan
significantly but the execution cost/time is not improved.

Any suggestion or pointers much appreciated.

Environment: Using  a virtual  CentOS Linux  release 7.4.1708  (Core), 4
cores (2.3GHz),  8G RAM and  postgres 10.0(beta3) shared_buffers  = 1GB,
work_mem = 2GB

Domain: (TL/DR)  A "segment" is  defined by  a particular set  of people
(probandset.id)  plus a  subset of  markers (markerset.id,  startmarker,
endmarker).   I need  the minimum  p-value for  each marker  in the  set
across all segments matching the set and a specific set of poeple.  So a
given segment says  "I cover all the markers from  startbase to endbase"
and each marker has a specific base position (relative to a chromosome).
I'm  after the  smallest  p-value  for each  marker  across  the set  of
segments which include that marker (from the 'between' clause).

Context:  I have  the query  in a  function so  the ids  of the  all the
players are available to the following sql:

select m.id as mkrid
   , min(pv(s.events_less, s.events_equal, s.events_greater, 0)) as 
optval
from marker m join segment s on m.basepos between s.startbase and s.endbase
 and m.chrom = 1
 and s.chrom = 1
 and s.markerset_id = suppliedMarkersetId
join probandset r on s.probandset_id = r.id
 and r.people_id =  suppliedPeopleId
group by m.id

where the pv function is

create or replace function pv(l bigint, e bigint, g bigint, o int)
returns numeric
as
$$
select 1.0*(g+e+o)/(l+e+g+o);
$$
language sql
;


I have the identical schema in  two databases (same pg instance) and the
tables definitions  involved are  below.  In one  schema there  are 1.7M
records  in segment  and in  the other  there is  40M rows.   The marker
tables are much more similar with 600K and 900K respectively.  The third
table, probandset, has 60 and 600 respectively. On average 0.8M and 1.8M
segments per markerset_id.

The explains: (fast  (12sec), then slow(hours)).  The  part which sticks
out  to  me  is  where  the "between"  gets  used.   (I'm  betting  that
probandset is  too small to  matter.)  The  slower explain plan  is very
similar to  what I  saw originally in  the now "fast"  data set  and the
current indexing stategy comes largely from that performance work.

It looks like I'm getting a Cartesian between the number of markers in a
set and the number of segments found: ten zeros at least.
  QUERY PLAN
--
 HashAggregate  (cost=291472.27..292040.58 rows=56831 width=48)
   Group Key: m.id
   ->  Nested Loop  (cost=3752.33..167295.52 rows=4515518 width=40)
 ->  Nested Loop  (cost=3751.90..17906.25 rows=715 width=32)
   ->  Seq Scan on probandset r  (cost=0.00..2.77 rows=4 width=16)
 Filter: (people_id = 
'4e3b9829-43a8-4f84-9df6-f120dc5b1a7e'::uuid)
   ->  Bitmap Heap Scan on segment s  (cost=3751.90..4473.96 
rows=191 width=48)
 Recheck Cond: ((probandset_id = r.id) AND (chrom = 1) AND 
(markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid))
 ->  BitmapAnd  (cost=3751.90..3751.90 rows=191 width=0)
   ->  Bitmap Index Scan on useg  (cost=0.00..72.61 
rows=2418 width=0)
 Index Cond: ((probandset_id = r.id) AND (chrom 
= 1))
   ->  Bitmap Index Scan on segment_markerset_id_idx  
(cost=0.00..3676.23 rows=140240 width=0)
 Index Cond: (markerset_id = 
'61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid)
 ->  Index Scan using marker_chrom_basepos_idx on marker m  
(cost=0.42..145.79 rows=6315 width=20)
   Index Cond: ((chrom = 1) AND (basepos >= s.startbase) AND (basepos 
<= s.endbase))
(15 rows)


 QUERY PLAN
-
 HashAggregate  (cost=83131331.81..83132151.44 rows=81963 width=48)
   Group Key: m.id
   ->  Nested Loop  (cost=1907.38..70802659.35 rows=448315362 width=40)
 Join Filter: ((m.basepos >= s.startbase) AND (m.basepos <= s.endbase))
 ->  Bitmap Heap Scan on marker m  (cost=1883.64..11009.18 rows=81963 
width=20)
   Recheck Cond: (chrom = 1)
   ->  Bitmap Index Scan on marker_chrom_basepos_idx  
(cost=0.00..1863.15 rows=81963 width=0)
 Index Cond: (chrom = 1)
 ->  Materialize  (cost=23.74..181468.38 rows=49228 width=32)