Hello everyone,
This is my first post on this list, I tried to look after possible solutions in
the archive, as well as in google, but I could not find an explanation for such
a specific situation.
I am facing a performance problem connected with Postgres Tsearch2 FTS
mechanism.
Here is my query:
select participant.participant_id from participant participant
join person person on person.person_participant_id = participant.participant_id
left join registration registration on
registration.registration_registered_participant_id = participant.participant_id
left join enrollment enrollment on registration.registration_enrollment_id =
enrollment.enrollment_id
join registration_configuration registration_configuration on
enrollment.enrollment_configuration_id =
registration_configuration.configuration_id
left join event_context context on context.context_id =
registration_configuration.configuration_context_id
where participant.participant_type = 'PERSON'
and participant_status = 'ACTIVE'
and context.context_code in ('GB2TST2010A')
and registration_configuration.configuration_type in ('VISITOR')
and registration_configuration.configuration_id is not null
and participant.participant_tsv || person.person_tsv @@
to_tsquery('simple',to_tsquerystring('Abigail'))
limit 100
As you see, I am using two vectors which I concatenate and check against a
tsquery.
Both vectors are indexed with GIN and updated with respective triggers in the
following way:
ALTER TABLE person ALTER COLUMN person_tsv SET STORAGE EXTENDED;
CREATE INDEX person_ft_index ON person USING gin(person_tsv);
CREATE OR REPLACE FUNCTION update_person_tsv() RETURNS trigger AS $$ BEGIN
NEW.person_tsv := to_tsvector('simple',create_tsv( ARRAY[NEW.person_first_name,
NEW.person_last_name, NEW.person_middle_name] )); RETURN NEW; END; $$ LANGUAGE
'plpgsql';
CREATE TRIGGER person_tsv_update BEFORE INSERT or UPDATE ON person FOR EACH ROW
EXECUTE PROCEDURE update_person_tsv();
ALTER TABLE participant ALTER COLUMN participant_tsv SET STORAGE EXTENDED;
CREATE INDEX participant_ft_index ON participant USING gin(participant_tsv);
CREATE OR REPLACE FUNCTION update_participant_tsv() RETURNS trigger AS $$ BEGIN
NEW.participant_tsv := to_tsvector('simple',create_tsv(
ARRAY[NEW.participant_login, NEW.participant_email] )); RETURN NEW; END; $$
LANGUAGE 'plpgsql';
CREATE TRIGGER participant_tsv_update BEFORE INSERT or UPDATE ON participant
FOR EACH ROW EXECUTE PROCEDURE update_participant_tsv();
The database is quite big - has almost one million of participant records. The
above query has taken almost 67 seconds to execute and fetch 100 rows, which is
unacceptable for us.
As I assume, the problem is, when the vectors are concatenated, the individual
indexes for each vector are not used. The execution plan done after 1st
execution of the query:
"Limit (cost=46063.13..93586.79 rows=100 width=4) (actual
time=4963.620..39703.645 rows=100 loops=1)"
" -> Nested Loop (cost=46063.13..493736.04 rows=942 width=4) (actual
time=4963.617..39703.349 rows=100 loops=1)"
" Join Filter: (registration_configuration.configuration_id =
enrollment.enrollment_configuration_id)"
" -> Nested Loop (cost=46063.13..493662.96 rows=3769 width=8) (actual
time=4963.517..39701.557 rows=159 loops=1)"
" -> Nested Loop (cost=46063.13..466987.33 rows=3769 width=8)
(actual time=4963.498..39698.542 rows=159 loops=1)"
" -> Hash Join (cost=46063.13..430280.76 rows=4984
width=8) (actual time=4963.464..39692.676 rows=216 loops=1)"
" Hash Cond: (participant.participant_id =
person.person_participant_id)"
" Join Filter: ((participant.participant_tsv ||
person.person_tsv) @@ to_tsquery('simple'::regconfig,
to_tsquerystring('Abigail'::text)))"
" -> Seq Scan on participant (cost=0.00..84680.85
rows=996741 width=42) (actual time=0.012..3132.944 rows=1007151 loops=1)"
" Filter: (((participant_type)::text =
'PERSON'::text) AND ((participant_status)::text = 'ACTIVE'::text))"
" -> Hash (cost=25495.39..25495.39 rows=1012539
width=38) (actual time=3145.628..3145.628 rows=1007151 loops=1)"
" Buckets: 2048 Batches: 128 Memory Usage:
556kB"
" -> Seq Scan on person (cost=0.00..25495.39
rows=1012539 width=38) (actual time=0.062..1582.990 rows=1007151 loops=1)"
" -> Index Scan using
idx_registration_registered_participant_id on registration (cost=0.00..7.35
rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=216)"
" Index Cond:
(registration.registration_registered_participant_id =
person.person_participant_id)"
" -> Index Scan using enrollment_pkey on enrollment
(cost=0.00..7.07 rows=1 width=8) (actual time=0.011..0.013 rows=1 loops=159)"
" Index Cond: (enrollment.enrollment_id =
registration.registration_enrollment_id)"
" -> Materialize (cost=0.00..16.55 rows=1 width=4) (actual
time=0.002..0.005 rows=2 loops=159)"
" -> Nested Loop (cost=0.00..16.55 rows=1 width=4) (actual
time=0.056..0.077 rows=2 loops=1)"
" Join Filter:
(registration_configuration.configuration_context_id = context.context_id)"
" -> Index Scan using idx_configuration_type on
registration_configuration (cost=0.00..8.27 rows=1 width=8) (actual
time=0.018..0.022 rows=3 loops=1)"
" Index Cond: ((configuration_type)::text =
'VISITOR'::text)"
" Filter: (configuration_id IS NOT NULL)"
" -> Index Scan using idx_event_context_code on
event_context context (cost=0.00..8.27 rows=1 width=4) (actual
time=0.008..0.010 rows=1 loops=3)"
" Index Cond: ((context.context_code)::text =
'GB2TST2010A'::text)"
"Total runtime: 39775.578 ms"
The assumption seems to be correct, no indexes on vectors are used - sequence
scans are done instead:
Join Filter: ((participant.participant_tsv || person.person_tsv) @@
to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
" -> Seq Scan on participant (cost=0.00..84680.85
rows=996741 width=42) (actual time=0.012..3132.944 rows=1007151 loops=1)"
" Filter: (((participant_type)::text =
'PERSON'::text) AND ((participant_status)::text = 'ACTIVE'::text))"
" -> Hash (cost=25495.39..25495.39 rows=1012539
width=38) (actual time=3145.628..3145.628 rows=1007151 loops=1)"
" Buckets: 2048 Batches: 128 Memory Usage:
556kB"
" -> Seq Scan on person (cost=0.00..25495.39
rows=1012539 width=38) (actual time=0.062..1582.990 rows=1007151 loops=1)"
After I removed one of the vectors from the query and used only a single vector
...
and person.person_tsv @@ to_tsquery('simple',
to_tsquery('simple',to_tsquerystring('Abigail'))
...
then the execution was much faster - about 5 seconds
Plan afterwards:
"Limit (cost=41.14..8145.82 rows=100 width=4) (actual time=3.776..13.454
rows=100 loops=1)"
" -> Nested Loop (cost=41.14..21923.77 rows=270 width=4) (actual
time=3.773..13.248 rows=100 loops=1)"
" -> Nested Loop (cost=41.14..19730.17 rows=270 width=8) (actual
time=3.760..11.971 rows=100 loops=1)"
" Join Filter: (registration_configuration.configuration_id =
enrollment.enrollment_configuration_id)"
" -> Nested Loop (cost=0.00..16.55 rows=1 width=4) (actual
time=0.051..0.051 rows=1 loops=1)"
" Join Filter:
(registration_configuration.configuration_context_id = context.context_id)"
" -> Index Scan using idx_configuration_type on
registration_configuration (cost=0.00..8.27 rows=1 width=8) (actual
time=0.020..0.022 rows=2 loops=1)"
" Index Cond: ((configuration_type)::text =
'VISITOR'::text)"
" Filter: (configuration_id IS NOT NULL)"
" -> Index Scan using idx_event_context_code on
event_context context (cost=0.00..8.27 rows=1 width=4) (actual
time=0.008..0.009 rows=1 loops=2)"
" Index Cond: ((context.context_code)::text =
'GB2TST2010A'::text)"
" -> Nested Loop (cost=41.14..19700.12 rows=1080 width=12)
(actual time=3.578..11.431 rows=269 loops=1)"
" -> Nested Loop (cost=41.14..12056.27 rows=1080 width=12)
(actual time=3.568..8.203 rows=269 loops=1)"
" -> Bitmap Heap Scan on person (cost=41.14..3687.07
rows=1080 width=4) (actual time=3.553..4.401 rows=346 loops=1)"
" Recheck Cond: (person_tsv @@
to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
" -> Bitmap Index Scan on person_ft_index
(cost=0.00..40.87 rows=1080 width=0) (actual time=3.353..3.353 rows=1060
loops=1)"
" Index Cond: (person_tsv @@
to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
" -> Index Scan using
idx_registration_registered_participant_id on registration (cost=0.00..7.74
rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=346)"
" Index Cond:
(registration.registration_registered_participant_id =
person.person_participant_id)"
" -> Index Scan using enrollment_pkey on enrollment
(cost=0.00..7.07 rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=269)"
" Index Cond: (enrollment.enrollment_id =
registration.registration_enrollment_id)"
" -> Index Scan using participant_pkey on participant (cost=0.00..8.11
rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=100)"
" Index Cond: (participant.participant_id =
person.person_participant_id)"
" Filter: (((participant.participant_type)::text = 'PERSON'::text)
AND ((participant.participant_status)::text = 'ACTIVE'::text))"
"Total runtime: 13.858 ms"
Now the index on vector was used:
"Recheck Cond: (person_tsv @@ to_tsquery('simple'::regconfig,
to_tsquerystring('Abigail'::text)))"
" -> Bitmap Index Scan on person_ft_index
(cost=0.00..40.87 rows=1080 width=0) (actual time=3.353..3.353 rows=1060
loops=1)"
" Index Cond: (person_tsv @@
to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
So, there is apparently a problem with vector concatenating - the indexes don't
work then. I tried to use the vectors separately and to make 'OR' comparison
between single vector @@ ts_query checks,
but it didn't help very much (performance was better, but still over 20 sec):
...
(participant.participant_tsv @@
to_tsquery('simple',to_tsquerystring('Abigail'))) OR (person.person_tsv @@
to_tsquery('simple',to_tsquerystring('Abigail')))
...
Is there a way to make this work with better performance? Or is it necessary to
create a single vector that contains data from multiple tables and then add an
index on it? It would be so far problematic for us,
because we are using multiple complex queries with variable number of selected
columns. I know that another solution might be an union among multiple queries,
every of which uses a single vector,
but this solution is inconvenient too.
Greetings
Jan
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance