Re: [GENERAL] query not scaling
On 10/31/2017 03:12 AM, Laurenz Albe wrote: Rob Sargent wrote: I think your biggest problem is the join condition on m.basepos between s.startbase and s.endbase That forces a nested loop join, which cannot be performed efficiently. Agree! 800,000 * 4,000 = 3,200,000,000. It's just that I thought I had corralled that problem which indexing but apparently not. I was hoping some kind soul might point out a missing index or similar correction. I have completely reworked the process, but not sure yet if it's correct. (The slow answer is correct, once it comes in.) You can create indexes that are useful for this query: ON sui.segment(chrom, markerset_id) ON sui.probandset(people_id) But that probably won't make a big difference, because the sequential scans take only a small fraction of your query time. A little less than half of the query time is spent in the nested loop join, and a little more than half of the time is spent doing the GROUP BY. Perhaps the biggest improvement you can easily make would be to get rid of "numeric" for the computation. I suspect that this is where a lot of time is spent, since the hash aggregate is over less than 15000 rows. Unless you really need the precision of "numeric", try CREATE OR REPLACE FUNCTION pv(l bigint, e bigint, g bigint, o int) RETURNS double precision LANGUAGE sql AS $$SELECT (g+e+o)::double precision / (l+e+g+o)::double precision$$; Yours, Laurenz Albe In practice markersets are always aligned with one chromosome so I would not expect this to have an effect. There's no constraint on this however, and there can be more than one markerset per chromosome. I have played with indexing on segment.markerset_id. In all the data sets used in the examples (runtimes, explains etc) there has been a in single people_id across the existing segment data. Down the road this of course will not be the case and I can see the value of an index on probandset.people_id eventually. I can certainly add it now for a test. I'm currently writing a probandset loader hoping to get a test case for the problem with gin indexing mentioned up-thread. I think I'm most surprise at the notion that the arithmetic is the problem and will happily test your suggestion to force floating point values. The value can get small (10^-12 on a good day!) but we don't need many digits of precision. Thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query not scaling
On Thu, 2017-10-26 at 19:07 -0600, Rob Sargent wrote: > -> Nested Loop (cost=3799.40..44686205.23 rows=1361304413 width=40) > (actual time=55.443..89684.451 rows=75577302 loops=1) > -> Hash Join (cost=3798.98..43611.56 rows=823591 width=32) > (actual time=55.393..1397.509 rows=823591 loops=1) > -> Index Scan using marker_chrom_basepos_idx on base.marker m > (cost=0.42..37.67 rows=1653 width=20) > (actual time=0.010..0.075 rows=92 loops=823591) > Index Cond: ((m.chrom = 22) AND (m.basepos >= s.startbase) AND (m.basepos <= s.endbase)) I think your biggest problem is the join condition on m.basepos between s.startbase and s.endbase That forces a nested loop join, which cannot be performed efficiently. Yours, Laurenz Albe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query not scaling
On Thu, Oct 26, 2017 at 10:01 AM, Tom Lanewrote: > Laurenz Albe writes: >> Also, to have PostgreSQL inline the function, which would be good >> for performance, it should be declared IMMUTABLE. > > Actually, if you hope to have a SQL function be inlined, it's better > not to decorate it at all --- not with IMMUTABLE, and not with STRICT > either. Both of those restrict the parser's ability to inline unless > it can prove the contained expression is equally immutable/strict. > With the default attributes of volatile/not strict, there's nothing > to prove. This is extremely obnoxious. Is it possible to raise a warning on function creation? > (In any case, it's usually easy enough to tell from EXPLAIN output > whether inlining has happened.) No it isn't. The explain syntax is arcane and inlining as a general concept is only very indirectly expressed. I really think we ought to do better here; I was not able to find any treatment of inlining given in the 'Performance Tips' or the 'Functions and Operators' section, or anywhere really (except the wiki). This is really a disservice to the users, I think. merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query not scaling
On 10/26/2017 09:01 AM, Tom Lane wrote: Laurenz Albewrites: Also, to have PostgreSQL inline the function, which would be good for performance, it should be declared IMMUTABLE. Actually, if you hope to have a SQL function be inlined, it's better not to decorate it at all --- not with IMMUTABLE, and not with STRICT either. Both of those restrict the parser's ability to inline unless it can prove the contained expression is equally immutable/strict. With the default attributes of volatile/not strict, there's nothing to prove. (In any case, it's usually easy enough to tell from EXPLAIN output whether inlining has happened.) regards, tom lane In another instance of the same schema, in same database as original slow execution I've loaded 823591 segments (though in this case all of them are on one chromosome, one markerset)**and 65K proband sets using same marker table as the slow(est) query. In the fastest run, there are only 46K segments for the given markerset. QUERY PLAN -- HashAggregate (cost=82122076.59..8215.35 rows=14876 width=48) (actual time=208203.091..208210.348 rows=14645 loops=1) Output: m.id, min(((1.0 * (((s.events_greater + s.events_equal) + 0))::numeric) / s.events_less + s.events_equal) + s.events_greater) + 0))::numeric)) Group Key: m.id Buffers: shared hit=43209090 -> Nested Loop (cost=3799.40..44686205.23 rows=1361304413 width=40) (actual time=55.443..89684.451 rows=75577302 loops=1) Output: m.id, s.events_greater, s.events_equal, s.events_less Buffers: shared hit=43209090 -> Hash Join (cost=3798.98..43611.56 rows=823591 width=32) (actual time=55.393..1397.509 rows=823591 loops=1) Output: s.events_greater, s.events_equal, s.events_less, s.startbase, s.endbase Inner Unique: true Hash Cond: (s.probandset_id = p.id) Buffers: shared hit=19222 -> Seq Scan on sui.segment s (cost=0.00..29414.86 rows=823591 width=48) (actual time=0.017..669.915 rows=823591 loops=1) Output: s.id, s.chrom, s.markerset_id, s.probandset_id, s.startbase, s.endbase, s.firstmarker, s.lastmarker, s.events_less, s.events_equal, s.events_greater Filter: ((s.chrom = 22) AND (s.markerset_id = 'edf95066-24d2-4ca1-bad6-aa850cc82fef'::uuid)) Buffers: shared hit=17061 -> Hash (cost=2979.99..2979.99 rows=65519 width=16) (actual time=55.272..55.272 rows=65519 loops=1) Output: p.id Buckets: 65536 Batches: 1 Memory Usage: 3584kB Buffers: shared hit=2161 -> Seq Scan on sui.probandset p (cost=0.00..2979.99 rows=65519 width=16) (actual time=0.007..33.582 rows=65519 loops=1) Output: p.id Filter: (p.people_id = '9b2308b1-9659-4a2c-91ae-8f95cd0a90b3'::uuid) Buffers: shared hit=2161 -> Index Scan using marker_chrom_basepos_idx on base.marker m (cost=0.42..37.67 rows=1653 width=20) (actual time=0.010..0.075 rows=92 loops=823591) Output: m.id, m.name, m.chrom, m.basepos, m.alleles Index Cond: ((m.chrom = 22) AND (m.basepos >= s.startbase) AND (m.basepos <= s.endbase)) Buffers: shared hit=43189868 Planning time: 0.764 ms Execution time: 208214.816 ms (30 rows)
Re: [GENERAL] query not scaling
On 10/26/2017 09:01 AM, Tom Lane wrote: Laurenz Albewrites: Also, to have PostgreSQL inline the function, which would be good for performance, it should be declared IMMUTABLE. Actually, if you hope to have a SQL function be inlined, it's better not to decorate it at all --- not with IMMUTABLE, and not with STRICT either. Both of those restrict the parser's ability to inline unless it can prove the contained expression is equally immutable/strict. With the default attributes of volatile/not strict, there's nothing to prove. (In any case, it's usually easy enough to tell from EXPLAIN output whether inlining has happened.) regards, tom lane As to the explain analyze, could not receive data from server: Connection timed out Time: 7877340.565 ms for the second time. I had presumed at first that this had occurred during a network burp. I'll try running it directly on the pg host.
Re: [GENERAL] query not scaling
Laurenz Albewrites: > Also, to have PostgreSQL inline the function, which would be good > for performance, it should be declared IMMUTABLE. Actually, if you hope to have a SQL function be inlined, it's better not to decorate it at all --- not with IMMUTABLE, and not with STRICT either. Both of those restrict the parser's ability to inline unless it can prove the contained expression is equally immutable/strict. With the default attributes of volatile/not strict, there's nothing to prove. (In any case, it's usually easy enough to tell from EXPLAIN output whether inlining has happened.) regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query not scaling
> On Oct 26, 2017, at 1:02 AM, Laurenz Albewrote: > > Rob Sargent wrote: >> I have a query I cannot tame and I'm wondering if there's an alternative >> to the "between" clause I'm using. Perhaps a custom type could do >> better? I've tried the "<@" orperator and that changes the query plan >> significantly but the execution cost/time is not improved. >> >> Any suggestion or pointers much appreciated. > > You really need EXPLAIN (ANALYZE, BUFFERS) output to be able to > undersrand what is going on. > > A couple of simple things to check: > > - Have all tables been ANALYZED beforehand? > - Are all optimizer database parameters identical? > > Also, to have PostgreSQL inline the function, which would be good > for performance, it should be declared IMMUTABLE. > > Yours, > Laurenz Albe The explain analyze was (maybe is)still running but without buffers. Thought of the immutable bit. Will be doing that test. All tables vacuumed and analyzed with each structural change. Thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query not scaling
Rob Sargent wrote: > I have a query I cannot tame and I'm wondering if there's an alternative > to the "between" clause I'm using. Perhaps a custom type could do > better? I've tried the "<@" orperator and that changes the query plan > significantly but the execution cost/time is not improved. > > Any suggestion or pointers much appreciated. You really need EXPLAIN (ANALYZE, BUFFERS) output to be able to undersrand what is going on. A couple of simple things to check: - Have all tables been ANALYZED beforehand? - Are all optimizer database parameters identical? Also, to have PostgreSQL inline the function, which would be good for performance, it should be declared IMMUTABLE. Yours, Laurenz Albe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] query not scaling
I have a query I cannot tame and I'm wondering if there's an alternative to the "between" clause I'm using. Perhaps a custom type could do better? I've tried the "<@" orperator and that changes the query plan significantly but the execution cost/time is not improved. Any suggestion or pointers much appreciated. Environment: Using a virtual CentOS Linux release 7.4.1708 (Core), 4 cores (2.3GHz), 8G RAM and postgres 10.0(beta3) shared_buffers = 1GB, work_mem = 2GB Domain: (TL/DR) A "segment" is defined by a particular set of people (probandset.id) plus a subset of markers (markerset.id, startmarker, endmarker). I need the minimum p-value for each marker in the set across all segments matching the set and a specific set of poeple. So a given segment says "I cover all the markers from startbase to endbase" and each marker has a specific base position (relative to a chromosome). I'm after the smallest p-value for each marker across the set of segments which include that marker (from the 'between' clause). Context: I have the query in a function so the ids of the all the players are available to the following sql: select m.id as mkrid , min(pv(s.events_less, s.events_equal, s.events_greater, 0)) as optval from marker m join segment s on m.basepos between s.startbase and s.endbase and m.chrom = 1 and s.chrom = 1 and s.markerset_id = suppliedMarkersetId join probandset r on s.probandset_id = r.id and r.people_id = suppliedPeopleId group by m.id where the pv function is create or replace function pv(l bigint, e bigint, g bigint, o int) returns numeric as $$ select 1.0*(g+e+o)/(l+e+g+o); $$ language sql ; I have the identical schema in two databases (same pg instance) and the tables definitions involved are below. In one schema there are 1.7M records in segment and in the other there is 40M rows. The marker tables are much more similar with 600K and 900K respectively. The third table, probandset, has 60 and 600 respectively. On average 0.8M and 1.8M segments per markerset_id. The explains: (fast (12sec), then slow(hours)). The part which sticks out to me is where the "between" gets used. (I'm betting that probandset is too small to matter.) The slower explain plan is very similar to what I saw originally in the now "fast" data set and the current indexing stategy comes largely from that performance work. It looks like I'm getting a Cartesian between the number of markers in a set and the number of segments found: ten zeros at least. QUERY PLAN -- HashAggregate (cost=291472.27..292040.58 rows=56831 width=48) Group Key: m.id -> Nested Loop (cost=3752.33..167295.52 rows=4515518 width=40) -> Nested Loop (cost=3751.90..17906.25 rows=715 width=32) -> Seq Scan on probandset r (cost=0.00..2.77 rows=4 width=16) Filter: (people_id = '4e3b9829-43a8-4f84-9df6-f120dc5b1a7e'::uuid) -> Bitmap Heap Scan on segment s (cost=3751.90..4473.96 rows=191 width=48) Recheck Cond: ((probandset_id = r.id) AND (chrom = 1) AND (markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid)) -> BitmapAnd (cost=3751.90..3751.90 rows=191 width=0) -> Bitmap Index Scan on useg (cost=0.00..72.61 rows=2418 width=0) Index Cond: ((probandset_id = r.id) AND (chrom = 1)) -> Bitmap Index Scan on segment_markerset_id_idx (cost=0.00..3676.23 rows=140240 width=0) Index Cond: (markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid) -> Index Scan using marker_chrom_basepos_idx on marker m (cost=0.42..145.79 rows=6315 width=20) Index Cond: ((chrom = 1) AND (basepos >= s.startbase) AND (basepos <= s.endbase)) (15 rows) QUERY PLAN - HashAggregate (cost=83131331.81..83132151.44 rows=81963 width=48) Group Key: m.id -> Nested Loop (cost=1907.38..70802659.35 rows=448315362 width=40) Join Filter: ((m.basepos >= s.startbase) AND (m.basepos <= s.endbase)) -> Bitmap Heap Scan on marker m (cost=1883.64..11009.18 rows=81963 width=20) Recheck Cond: (chrom = 1) -> Bitmap Index Scan on marker_chrom_basepos_idx (cost=0.00..1863.15 rows=81963 width=0) Index Cond: (chrom = 1) -> Materialize (cost=23.74..181468.38 rows=49228 width=32)