[PERFORM] Tsearch2 - bad performance with concatenated ts-vectors

Jan Wielgus Tue, 02 Aug 2011 03:04:15 -0700

Hello everyone,

This is my first post on this list, I tried to look after possible solutions in 
the archive, as well as in google, but I could not find an explanation for such 
a specific situation.


I am facing a performance problem connected with Postgres Tsearch2 FTS 
mechanism.

Here is my query:

select participant.participant_id from participant participant
join person person on person.person_participant_id = participant.participant_id
left join registration registration on 
registration.registration_registered_participant_id = participant.participant_id
left join enrollment enrollment on registration.registration_enrollment_id = 
enrollment.enrollment_id
join registration_configuration registration_configuration on 
enrollment.enrollment_configuration_id = 
registration_configuration.configuration_id
left join event_context context on context.context_id = 
registration_configuration.configuration_context_id 
where participant.participant_type = 'PERSON'
and participant_status = 'ACTIVE'
and context.context_code in ('GB2TST2010A')                             
and registration_configuration.configuration_type in ('VISITOR')
and registration_configuration.configuration_id is not null
and participant.participant_tsv || person.person_tsv @@ 
to_tsquery('simple',to_tsquerystring('Abigail'))
limit 100

As you see, I am using two vectors which I concatenate and check against a 
tsquery. 

Both vectors are indexed with GIN and updated with respective triggers in the 
following way:

ALTER TABLE person ALTER COLUMN person_tsv SET STORAGE EXTENDED; 
CREATE INDEX person_ft_index ON person USING gin(person_tsv); 
CREATE OR REPLACE FUNCTION update_person_tsv() RETURNS trigger AS $$ BEGIN 
NEW.person_tsv := to_tsvector('simple',create_tsv( ARRAY[NEW.person_first_name, 
NEW.person_last_name, NEW.person_middle_name] )); RETURN NEW; END; $$ LANGUAGE 
'plpgsql';
CREATE TRIGGER person_tsv_update BEFORE INSERT or UPDATE ON person FOR EACH ROW 
EXECUTE PROCEDURE update_person_tsv();

ALTER TABLE participant ALTER COLUMN participant_tsv SET STORAGE EXTENDED; 
CREATE INDEX participant_ft_index ON participant USING gin(participant_tsv); 
CREATE OR REPLACE FUNCTION update_participant_tsv() RETURNS trigger AS $$ BEGIN 
NEW.participant_tsv := to_tsvector('simple',create_tsv( 
ARRAY[NEW.participant_login, NEW.participant_email] )); RETURN NEW; END; $$ 
LANGUAGE 'plpgsql';
CREATE TRIGGER participant_tsv_update BEFORE INSERT or UPDATE ON participant 
FOR EACH ROW EXECUTE PROCEDURE update_participant_tsv();

The database is quite big - has almost one million of participant records. The 
above query has taken almost 67 seconds to execute and fetch 100 rows, which is 
unacceptable for us.

As I assume, the problem is, when the vectors are concatenated, the individual 
indexes for each vector are not used. The execution plan done after 1st 
execution of the query:

"Limit  (cost=46063.13..93586.79 rows=100 width=4) (actual 
time=4963.620..39703.645 rows=100 loops=1)"
"  ->  Nested Loop  (cost=46063.13..493736.04 rows=942 width=4) (actual 
time=4963.617..39703.349 rows=100 loops=1)"
"        Join Filter: (registration_configuration.configuration_id = 
enrollment.enrollment_configuration_id)"
"        ->  Nested Loop  (cost=46063.13..493662.96 rows=3769 width=8) (actual 
time=4963.517..39701.557 rows=159 loops=1)"
"              ->  Nested Loop  (cost=46063.13..466987.33 rows=3769 width=8) 
(actual time=4963.498..39698.542 rows=159 loops=1)"
"                    ->  Hash Join  (cost=46063.13..430280.76 rows=4984 
width=8) (actual time=4963.464..39692.676 rows=216 loops=1)"
"                          Hash Cond: (participant.participant_id = 
person.person_participant_id)"
"                          Join Filter: ((participant.participant_tsv || 
person.person_tsv) @@ to_tsquery('simple'::regconfig, 
to_tsquerystring('Abigail'::text)))"
"                          ->  Seq Scan on participant  (cost=0.00..84680.85 
rows=996741 width=42) (actual time=0.012..3132.944 rows=1007151 loops=1)"
"                                Filter: (((participant_type)::text = 
'PERSON'::text) AND ((participant_status)::text = 'ACTIVE'::text))"
"                          ->  Hash  (cost=25495.39..25495.39 rows=1012539 
width=38) (actual time=3145.628..3145.628 rows=1007151 loops=1)"
"                                Buckets: 2048  Batches: 128  Memory Usage: 
556kB"
"                                ->  Seq Scan on person  (cost=0.00..25495.39 
rows=1012539 width=38) (actual time=0.062..1582.990 rows=1007151 loops=1)"
"                    ->  Index Scan using 
idx_registration_registered_participant_id on registration  (cost=0.00..7.35 
rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=216)"
"                          Index Cond: 
(registration.registration_registered_participant_id = 
person.person_participant_id)"
"              ->  Index Scan using enrollment_pkey on enrollment  
(cost=0.00..7.07 rows=1 width=8) (actual time=0.011..0.013 rows=1 loops=159)"
"                    Index Cond: (enrollment.enrollment_id = 
registration.registration_enrollment_id)"
"        ->  Materialize  (cost=0.00..16.55 rows=1 width=4) (actual 
time=0.002..0.005 rows=2 loops=159)"
"              ->  Nested Loop  (cost=0.00..16.55 rows=1 width=4) (actual 
time=0.056..0.077 rows=2 loops=1)"
"                    Join Filter: 
(registration_configuration.configuration_context_id = context.context_id)"
"                    ->  Index Scan using idx_configuration_type on 
registration_configuration  (cost=0.00..8.27 rows=1 width=8) (actual 
time=0.018..0.022 rows=3 loops=1)"
"                          Index Cond: ((configuration_type)::text = 
'VISITOR'::text)"
"                          Filter: (configuration_id IS NOT NULL)"
"                    ->  Index Scan using idx_event_context_code on 
event_context context  (cost=0.00..8.27 rows=1 width=4) (actual 
time=0.008..0.010 rows=1 loops=3)"
"                          Index Cond: ((context.context_code)::text = 
'GB2TST2010A'::text)"
"Total runtime: 39775.578 ms"

The assumption seems to be correct, no indexes on vectors are used - sequence 
scans are done instead:

Join Filter: ((participant.participant_tsv || person.person_tsv) @@ 
to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
"                          ->  Seq Scan on participant  (cost=0.00..84680.85 
rows=996741 width=42) (actual time=0.012..3132.944 rows=1007151 loops=1)"
"                                Filter: (((participant_type)::text = 
'PERSON'::text) AND ((participant_status)::text = 'ACTIVE'::text))"
"                          ->  Hash  (cost=25495.39..25495.39 rows=1012539 
width=38) (actual time=3145.628..3145.628 rows=1007151 loops=1)"
"                                Buckets: 2048  Batches: 128  Memory Usage: 
556kB"
"                                ->  Seq Scan on person  (cost=0.00..25495.39 
rows=1012539 width=38) (actual time=0.062..1582.990 rows=1007151 loops=1)"


After I removed one of the vectors from the query and used only a single vector 
...
and person.person_tsv @@ to_tsquery('simple', 
to_tsquery('simple',to_tsquerystring('Abigail'))
...
then the execution was much faster - about 5 seconds

Plan afterwards:

"Limit  (cost=41.14..8145.82 rows=100 width=4) (actual time=3.776..13.454 
rows=100 loops=1)"
"  ->  Nested Loop  (cost=41.14..21923.77 rows=270 width=4) (actual 
time=3.773..13.248 rows=100 loops=1)"
"        ->  Nested Loop  (cost=41.14..19730.17 rows=270 width=8) (actual 
time=3.760..11.971 rows=100 loops=1)"
"              Join Filter: (registration_configuration.configuration_id = 
enrollment.enrollment_configuration_id)"
"              ->  Nested Loop  (cost=0.00..16.55 rows=1 width=4) (actual 
time=0.051..0.051 rows=1 loops=1)"
"                    Join Filter: 
(registration_configuration.configuration_context_id = context.context_id)"
"                    ->  Index Scan using idx_configuration_type on 
registration_configuration  (cost=0.00..8.27 rows=1 width=8) (actual 
time=0.020..0.022 rows=2 loops=1)"
"                          Index Cond: ((configuration_type)::text = 
'VISITOR'::text)"
"                          Filter: (configuration_id IS NOT NULL)"
"                    ->  Index Scan using idx_event_context_code on 
event_context context  (cost=0.00..8.27 rows=1 width=4) (actual 
time=0.008..0.009 rows=1 loops=2)"
"                          Index Cond: ((context.context_code)::text = 
'GB2TST2010A'::text)"
"              ->  Nested Loop  (cost=41.14..19700.12 rows=1080 width=12) 
(actual time=3.578..11.431 rows=269 loops=1)"
"                    ->  Nested Loop  (cost=41.14..12056.27 rows=1080 width=12) 
(actual time=3.568..8.203 rows=269 loops=1)"
"                          ->  Bitmap Heap Scan on person  (cost=41.14..3687.07 
rows=1080 width=4) (actual time=3.553..4.401 rows=346 loops=1)"
"                                Recheck Cond: (person_tsv @@ 
to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
"                                ->  Bitmap Index Scan on person_ft_index  
(cost=0.00..40.87 rows=1080 width=0) (actual time=3.353..3.353 rows=1060 
loops=1)"
"                                      Index Cond: (person_tsv @@ 
to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
"                          ->  Index Scan using 
idx_registration_registered_participant_id on registration  (cost=0.00..7.74 
rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=346)"
"                                Index Cond: 
(registration.registration_registered_participant_id = 
person.person_participant_id)"
"                    ->  Index Scan using enrollment_pkey on enrollment  
(cost=0.00..7.07 rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=269)"
"                          Index Cond: (enrollment.enrollment_id = 
registration.registration_enrollment_id)"
"        ->  Index Scan using participant_pkey on participant  (cost=0.00..8.11 
rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=100)"
"              Index Cond: (participant.participant_id = 
person.person_participant_id)"
"              Filter: (((participant.participant_type)::text = 'PERSON'::text) 
AND ((participant.participant_status)::text = 'ACTIVE'::text))"
"Total runtime: 13.858 ms"

Now the index on vector was used:

"Recheck Cond: (person_tsv @@ to_tsquery('simple'::regconfig, 
to_tsquerystring('Abigail'::text)))"
"                                ->  Bitmap Index Scan on person_ft_index  
(cost=0.00..40.87 rows=1080 width=0) (actual time=3.353..3.353 rows=1060 
loops=1)"
"                                      Index Cond: (person_tsv @@ 
to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"

So, there is apparently a problem with vector concatenating - the indexes don't 
work then. I tried to use the vectors separately and to make 'OR' comparison 
between single vector @@ ts_query checks,
but it didn't help very much (performance was better, but still over 20 sec):
...
(participant.participant_tsv @@ 
to_tsquery('simple',to_tsquerystring('Abigail'))) OR (person.person_tsv @@ 
to_tsquery('simple',to_tsquerystring('Abigail'))) 
...

Is there a way to make this work with better performance? Or is it necessary to 
create a single vector that contains data from multiple tables and then add an 
index on it? It would be so far problematic for us,
because we are using multiple complex queries with variable number of selected 
columns. I know that another solution might be an union among multiple queries, 
every of which uses a single vector,
but this solution is inconvenient too.

Greetings

Jan

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Tsearch2 - bad performance with concatenated ts-vectors

Reply via email to