Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-31 Thread Dmitry Lazurkin
On 31.07.2017 19:42, Jeff Janes wrote:
> I think it is simply because no one has gotten around to implementing
> it that way.  When you can just write it as a values list instead, the
> incentive to make the regular in-list work better is not all that strong.
>
> Cheers,
>
> Jeff

I see from explain that IN-clause uses just array with function ANY. I
think for efficient implementation of this task I should implement new
datatype "hashset". Am I wrong?


Re: [GENERAL] Planner statistics usage for composite type

2017-07-20 Thread Dmitry Lazurkin
On 07/19/2017 06:37 PM, Tom Lane wrote:
> Dmitry Lazurkin <dila...@gmail.com> writes:
>> I am trying to find workaround for cross-column statistics.
>> ...
>> Worn estimate. Planner doesn't use statistics. In code I see usage of
>> function scalargtsel which returns default selectivity because
>> ROW('tag1', 0.9)::tag_sim is not Const.
> regression=# EXPLAIN ANALYZE SELECT * FROM tags WHERE ROW(t, s)::tag_sim >= 
> '(tag1,0.9)'::tag_sim AND
> ROW(t, s)::tag_sim <= '(tag1,1.0)'::tag_sim;
>   QUERY PLAN  
> 
> --
>  Index Scan using tags_composite on tags  (cost=0.29..8.30 rows=1 width=13) 
> (actual time=0.051..0.051 rows=0 loops=1)
>Index Cond: ((ROW(t, s)::tag_sim >= '(tag1,0.9)'::tag_sim) AND (ROW(t, 
> s)::tag_sim <= '(tag1,1)'::tag_sim))
>  Planning time: 0.230 ms
>  Execution time: 0.110 ms
> (4 rows)  

Thank you. (:



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-25 Thread Dmitry Lazurkin

On 25.07.2017 05:50, Jeff Janes wrote:
It isn't either-or.  It is the processing of millions of rows over the 
large in-list which is taking the time. Processing an in-list as a 
hash table would be great, but no one has gotten around to it 
implementing it yet.  Maybe Dmitry will be the one to do that.


Thanks. Yes, I want. But... Is this task too complex for novice?


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-24 Thread Dmitry Lazurkin

On 25.07.2017 00:17, PT wrote:

The IN clause is not what's taking all the time. It's the processing of
millions of rows that's taking all the time.


IN (...) - 17 sec
IN (VALUES ...) - 4 sec
So performance issue is with IN-clause.


Perhaps you should better describe what it is you really want to accomplish.
Regardless of what it is, if it involves processing many millions of rows,
you're probably going to need to do some sort of materialization.


I try to find better solutions for IN-task.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-24 Thread Dmitry Lazurkin

On 25.07.2017 00:31, David G. Johnston wrote:


Basically you want to write something like:

SELECT *
FROM ids
JOIN ( :values_clause ) vc (vid) ON (vc.vid = ids.id )​

or

WITH vc AS (SELECT vid FROM  ORDER BY ... LIMIT )
SELECT *
FROM ids
JOIN vc ON (vid = ids.id )



This query uses JOIN plan node as IN (VALUES ...).

And I have one question. I don't understand why IN-VALUES doesn't use 
Semi-Join? PostgreSQL has Hash Semi-Join...  For which task the database 
has node of this type?




Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-24 Thread Dmitry Lazurkin

On 25.07.2017 01:25, David G. Johnston wrote:
On Mon, Jul 24, 2017 at 3:22 PM, Dmitry Lazurkin <dila...@gmail.com 
<mailto:dila...@gmail.com>>wrote:


ALTER TABLE ids ALTER COLUMN id SET NOT NULL;
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN
:values_clause;

 Aggregate  (cost=245006.46..245006.47 rows=1 width=8) (actual
time=3824.095..3824.095 rows=1 loops=1)
   Buffers: shared hit=44248
   ->  Hash Join  (cost=7.50..235006.42 rows=419 width=0)
(actual time=1.108..3327.112 rows=3998646 loops=1)
   ...


​You haven't constrained the outer relation (i.e., :values_clause) to 
be non-null which is what I believe is required for the semi-join 
algorithm to be considered.​


David J.


CREATE TABLE second_ids (i bigint);
INSERT INTO second_ids :values_clause;

EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN (select 
i from second_ids);


 Aggregate  (cost=225004.36..225004.37 rows=1 width=8) (actual 
time=3826.641..3826.641 rows=1 loops=1)

   Buffers: shared hit=44249
   ->  Hash Semi Join  (cost=5.50..215004.32 rows=419 width=0) 
(actual time=0.352..3338.601 rows=3998646 loops=1)

 Hash Cond: (ids.id = second_ids.i)
 Buffers: shared hit=44249
 ->  Seq Scan on ids  (cost=0.00..144248.48 rows=1048 
width=8) (actual time=0.040..1069.006 rows=1000 loops=1)

   Buffers: shared hit=44248
 ->  Hash  (cost=3.00..3.00 rows=200 width=8) (actual 
time=0.288..0.288 rows=200 loops=1)

   Buckets: 1024  Batches: 1  Memory Usage: 16kB
   Buffers: shared hit=1
   ->  Seq Scan on second_ids  (cost=0.00..3.00 rows=200 
width=8) (actual time=0.024..0.115 rows=200 loops=1)

 Buffers: shared hit=1
 Planning time: 0.413 ms
 Execution time: 3826.752 ms

Hash Semi-Join without NOT NULL constraint on second table.


Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-24 Thread Dmitry Lazurkin

On 25.07.2017 01:15, David G. Johnston wrote:
On Mon, Jul 24, 2017 at 3:12 PM, Dmitry Lazurkin <dila...@gmail.com 
<mailto:dila...@gmail.com>>wrote:


And I have one question. I don't understand why IN-VALUES doesn't
use Semi-Join? PostgreSQL has Hash Semi-Join...  For which task
the database has node of this type?


​Semi-Join is canonically written as:

SELECT *
FROM tbl
WHERE EXISTS (SELECT 1 FROM tbl2 WHERE tbl.id <http://tbl.id> = 
tbl2.id <http://tbl2.id>)


The main difference between IN and EXISTS is NULL semantics.

David J.



ALTER TABLE ids ALTER COLUMN id SET NOT NULL;
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN 
:values_clause;


 Aggregate  (cost=245006.46..245006.47 rows=1 width=8) (actual 
time=3824.095..3824.095 rows=1 loops=1)

   Buffers: shared hit=44248
   ->  Hash Join  (cost=7.50..235006.42 rows=419 width=0) (actual 
time=1.108..3327.112 rows=3998646 loops=1)

   ...

Hmmm. No Semi-Join.


PostgreSQL can use Semi-Join for IN too.



Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-26 Thread Dmitry Lazurkin
On 23.07.2017 14:35, dilaz03 . wrote:
> - IN-VALUES clause adds new node to plan. Has additional node big
> overhead? How about filter by two or more IN-VALUES clause?
>

Hmmm. This works.

-- Full table can fit in memory
show shared_buffers;
 shared_buffers

4GB


show work_mem;
 work_mem
--
 16MB


SET max_parallel_workers_per_gather TO 0;
SET max_parallel_workers TO 0;

-- 10 000 000 events of 30 types from 500 sources
CREATE TABLE events AS
SELECT trunc(random() * 500)::bigint AS source_id, md5(trunc(random() *
30)::text) AS type
FROM generate_series(1, 1000);

-- Prepare all clauses
SELECT ('(' || string_agg(source_id::text, ',') || ')') AS
source_id_in_clause
FROM (SELECT source_id FROM events GROUP BY source_id ORDER BY source_id
LIMIT 200) AS s \gset

SELECT ('(' || string_agg(( || type || ), ',') || ')') AS
type_in_clause
FROM (SELECT type FROM events GROUP BY type ORDER BY type LIMIT 100) AS
s \gset

SELECT ('(VALUES ' || string_agg('(' || source_id::text || ')', ',') ||
')') AS source_id_values_clause
FROM (SELECT source_id FROM events GROUP BY source_id ORDER BY source_id
LIMIT 200) AS s \gset

SELECT ('(VALUES ' || string_agg('(''' || type::text || ''')', ',') ||
')') AS type_values_clause
FROM (SELECT type FROM events GROUP BY type ORDER BY type LIMIT 100) AS
s \gset

-- Run queries
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM events WHERE source_id
IN :source_id_in_clause AND type IN :type_in_clause;
 Execution time: 21314.277 ms

EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM events WHERE source_id
IN :source_id_values_clause AND type IN :type_in_clause;
 Execution time: 9421.592 ms

EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM events WHERE source_id
IN :source_id_in_clause AND type IN :type_values_clause;
 Execution time: 17598.467 ms

EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM events WHERE source_id
IN :source_id_values_clause AND type IN :type_values_clause;
 Execution time: 5589.925 ms





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-24 Thread Dmitry Lazurkin
On 07/24/2017 01:40 AM, PT wrote:
> In this example you count approximately 40,000,000 values, which is
> about 40% of the table. 
4 000 000 (:

> If you really need these queries to be faster, I would suggest
> materializing the data, i.e. create a table like:
>
> CREATE TABLE id_counts (
>  id BIGINT PRIMARY KEY,
>  num BIGINT
> )
>
> Then use a trigger or similar technique to keep id_counts in sync
> with the id table. You can then run queries of the form:
>
> SELECT sum(num) FROM id_counts WHERE id IN :values:
>
> which I would wager houseboats will be significantly faster.
I use count only for example because it uses seqscan. I want optimize
IN-clause ;-).

Thanks.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-07-24 Thread Dmitry Lazurkin
On 07/24/2017 01:40 AM, PT wrote:
> In this example you count approximately 40,000,000 values, which is
> about 40% of the table. 
4 000 000 (:

> If you really need these queries to be faster, I would suggest
> materializing the data, i.e. create a table like:
>
> CREATE TABLE id_counts (
>  id BIGINT PRIMARY KEY,
>  num BIGINT
> )
>
> Then use a trigger or similar technique to keep id_counts in sync
> with the id table. You can then run queries of the form:
>
> SELECT sum(num) FROM id_counts WHERE id IN :values:
>
> which I would wager houseboats will be significantly faster.
I use count only for example because it uses seqscan. I want optimize
IN-clause ;-).

Thanks.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Planner statistics usage for composite type

2017-07-19 Thread Dmitry Lazurkin
Hello.

I am trying to find workaround for cross-column statistics. For example,
I have tags with similarity:
select version();
  version
---
 PostgreSQL 10beta2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit

CREATE TABLE tags(
   t varchar(30) NOT NULL,
   s double precision
);
INSERT INTO tags (SELECT 'tag1'::text as t, 0.7 as s from
generate_series(0, 1));

I think i can create index for cross-column statistics:
CREATE TYPE tag_sim AS (
   t varchar(30),
   s double precision
);
CREATE INDEX tags_composite ON tags USING btree ((ROW(t, s)::tag_sim));
ANALYZE tags;
SELECT * FROM pg_stats WHERE tablename = 'tags_composite';
-[ RECORD 1 ]--+---
schemaname | public
tablename  | tags_composite
attname| row
inherited  | f
null_frac  | 0
avg_width  | 40
n_distinct | 1
most_common_vals   | {"(tag1,0.7)"}
most_common_freqs  | {1}
histogram_bounds   | (null)
correlation| 1
most_common_elems  | (null)
most_common_elem_freqs | (null)
elem_count_histogram   | (null)

OK, I have statistics. Search:
EXPLAIN SELECT * FROM tags WHERE ROW(t, s)::tag_sim >= ROW('tag1',
0.9)::tag_sim AND
ROW(t, s)::tag_sim <= ROW('tag1', 1.0)::tag_sim;
  QUERY PLAN
---
 Index Scan using tags_composite on tags  (cost=0.29..9.29 rows=50 width=13)
   Index Cond: ((ROW(t, s)::tag_sim >= ROW('tag1', 0.9)) AND (ROW(t,
s)::tag_sim <= ROW('tag1', 1.0)))
(2 rows)

Worn estimate. Planner doesn't use statistics. In code I see usage of
function scalargtsel which returns default selectivity because
ROW('tag1', 0.9)::tag_sim is not Const.

May be someone known how to fix this issue?

Thanks.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

2017-08-01 Thread Dmitry Lazurkin
On 08/01/2017 07:13 PM, Jeff Janes wrote:
> I think that HashSet is a Java-specific term.  It is just a hash table
> in which there is no data to store, just the key itself (and probably
> a cash of the hashcode of that key), correct? 

Yes. And in Java HashSet implemented on top of HashMap (:

> I think a more general solution would be to get the planner and
> executor to run the in-list query using the Hash Join, the same way it
> runs the in-VALUES one.

Have additional plan nodes big overhead?

> I was impressed at how well the JSON and hstore worked, you might want
> to look at how they do it.  It is must be using an internal hash table
> of some sort.

JSONB and HSTORE keep sorted pairs and use binary search.



[GENERAL] Change location of function/type installed from C-extension

2017-08-24 Thread Dmitry Lazurkin
Hello.

I have database with installed pg_trgm extension with module path
'/usr/lib/pg_trgm' (yes, this is mistake without $libdir (: ). Now I
want upgrade postgresql to new major version. I keep old version 9.3 in
/opt/postgresql/9.3 and new version 9.6 in /usr. Old version $libdir -
/opt/postgresql/9.3/lib, new version $libdir - /usr/lib/postgresql. So
now I am in trap because old version try to load shared libarry from
'/usr/lib/pg_trgm'. So I want change '/usr/lib/pg_trgm' to
'$libdir/pg_trgm' for old database. And I have only one ugly solution:
replace '/usr/lib/pg_trgm' with '$libdir//pg_trgm' in table files
(/usr/lib - 8 chars, $libdir/ - 8 chars).

May be someone has better solution?

Thanks.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Change location of function/type installed from C-extension

2017-08-26 Thread Dmitry Lazurkin
I try investigate where PotsgreSQL keeps path of load libraries in catalog.

select version();
   version
--
PostgreSQL 9.3.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit


select oid, datname from pg_database where datname = 'user';

  oid  | datname
---+-
 16384 | user


$ grep --text --null-data '$libdir/pg_trgm' data/test/base/16384/*

data/test/base/16384/11829:set_limit!$libdir/pg_trgm ...


select relname, relfilenode from pg_class where relfilenode = 11829;
 relname | relfilenode
-+-
(0 rows)


Hmmm. Where is table with filenode 11829?

Thanks.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Change location of function/type installed from C-extension

2017-08-26 Thread Dmitry Lazurkin
On 26.08.2017 15:10, Dmitry Lazurkin wrote:
> I try investigate where PotsgreSQL keeps path of load libraries in catalog.
>
> select version();
>version
> --
> PostgreSQL 9.3.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu
> 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit
>
>
> select oid, datname from pg_database where datname = 'user';
>
>   oid  | datname
> ---+-
>  16384 | user
>
>
> $ grep --text --null-data '$libdir/pg_trgm' data/test/base/16384/*
>
> data/test/base/16384/11829:set_limit!$libdir/pg_trgm ...
>
>
> select relname, relfilenode from pg_class where relfilenode = 11829;
>  relname | relfilenode
> -+-
> (0 rows)
>
>
> Hmmm. Where is table with filenode 11829?
>
> Thanks.
>

This is pg_proc.

inst/test/bin/oid2name -d user -f 11829
>From database "user":
  Filenode  Table Name
--
 11829 pg_proc



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Change location of function/type installed from C-extension

2017-08-26 Thread Dmitry Lazurkin
On 26.08.2017 18:24, Tom Lane wrote:
> You need to use pg_relation_filenode():
>
> regression=# select relname, pg_relation_filenode(oid) from pg_class where 
> relname like 'pg_proc%';
> relname | pg_relation_filenode 
> +--
>  pg_proc_oid_index  |12662
>  pg_proc|12657
>  pg_proc_proname_args_nsp_index |12663
> (3 rows)
>
>   regards, tom lane

Thanks. Can I update "pg_proc.probin" without any problems?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Change location of function/type installed from C-extension

2017-08-26 Thread Dmitry Lazurkin
On 26.08.2017 22:05, Tom Lane wrote:
> Dmitry Lazurkin <dila...@gmail.com> writes:
>> Thanks. Can I update "pg_proc.probin" without any problems?
> Should work.  I'd experiment in a scratch database before doing
> it in production, but I can't think of a problem offhand.
>
>   regards, tom lane

Thank you. That's working. I will try to implement in production. Update
with sql is better than replace in binary file (: .


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general