On 2/22/07, Teodor Sigaev <[EMAIL PROTECTED]> wrote:
>  From a previous discussion with Teodor, it would be better to store an
> int in the index instead of a text (it takes less space and is
> faster). I couldn't find any example so if anyone has an advice to fix
> that, it's welcome (mostly how to pack the trigram into an int instead
> of a text).

Something like that: [snip]

I worked on that this morning before I got your email. You'll find
attached a second version of the patch using int4 to store the
trigrams (it's not exactly what you gave me but it should work I
think).

I didn't see any improvement in terms of size of the index (14 MB for
642 738 rows in the index in both cases) or speed.
Our dictionary table contains 78367 words and its size is 3 MB.

Did I miss something?

If there's no benefit to use an int4, I propose we keep the text
version because it will be better if we add UTF-8 support someday.

our query. So, our consistent function should say FALSE when indexed value is
not similar to query exactly and TRUE in opposite case. Let lquery is a length
of query and ncommon is a number of common trigrams (look at cnt_sml()
function), and consistent function should be:

Yes, that's what I did in my patch if I understand you correctly but
it's not very selective IMHO.

Now consistent function doesn't guarantee exact result, so we should mark '%'
operator in CREATE OPERATOR CLASS with RECHECK option.

The attached patch adds a RECHECK too. It seems to work correctly but
the RECHECK COND costs a lot of time :/.

At the moment, I have the following results:
GIST:
-------
test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml
FROM lieu_mots_gist WHERE word % 'aub' ORDER BY sml DESC;
                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=204.40..204.59 rows=78 width=11) (actual
time=98.028..98.132 rows=50 loops=1)
  Sort Key: similarity('aub'::text, word)
  ->  Bitmap Heap Scan on lieu_mots_gist  (cost=4.88..201.95 rows=78
width=11) (actual time=97.575..97.861 rows=50 loops=1)
        Recheck Cond: (word % 'aub'::text)
        ->  Bitmap Index Scan on idx_word_gist  (cost=0.00..4.86
rows=78 width=0) (actual time=97.539..97.539 rows=50 loops=1)
              Index Cond: (word % 'aub'::text)
Total runtime: 98.303 ms
(7 rows)

test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml
FROM lieu_mots_gist WHERE word % 'auberge' ORDER BY sml DESC;
                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=204.40..204.59 rows=78 width=11) (actual
time=135.128..135.196 rows=41 loops=1)
  Sort Key: similarity('auberge'::text, word)
  ->  Bitmap Heap Scan on lieu_mots_gist  (cost=4.88..201.95 rows=78
width=11) (actual time=134.829..135.016 rows=41 loops=1)
        Recheck Cond: (word % 'auberge'::text)
        ->  Bitmap Index Scan on idx_word_gist  (cost=0.00..4.86
rows=78 width=0) (actual time=134.797..134.797 rows=41 loops=1)
              Index Cond: (word % 'auberge'::text)
Total runtime: 135.335 ms
(7 rows)

With GIN:
------------

test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml
FROM lieu_mots_gin WHERE word % 'aub' ORDER BY sml DESC;
                                                          QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=208.45..208.64 rows=78 width=11) (actual
time=60.409..60.513 rows=50 loops=1)
  Sort Key: similarity('aub'::text, word)
  ->  Bitmap Heap Scan on lieu_mots_gin  (cost=8.93..205.99 rows=78
width=11) (actual time=10.279..60.228 rows=50 loops=1)
        Filter: (word % 'aub'::text)
        ->  Bitmap Index Scan on idx_word_gin  (cost=0.00..8.91
rows=78 width=0) (actual time=10.069..10.069 rows=6676 loops=1)
              Index Cond: (word % 'aub'::text)
Total runtime: 60.688 ms
(7 rows)

test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml
FROM lieu_mots_gin WHERE word % 'auberge' ORDER BY sml DESC;
                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=208.45..208.64 rows=78 width=11) (actual
time=50.293..50.381 rows=41 loops=1)
  Sort Key: similarity('auberge'::text, word)
  ->  Bitmap Heap Scan on lieu_mots_gin  (cost=8.93..205.99 rows=78
width=11) (actual time=21.006..50.122 rows=41 loops=1)
        Filter: (word % 'auberge'::text)
        ->  Bitmap Index Scan on idx_word_gin  (cost=0.00..8.91
rows=78 width=0) (actual time=20.839..20.839 rows=928 loops=1)
              Index Cond: (word % 'auberge'::text)
Total runtime: 50.534 ms
(7 rows)

--
Guillaume
Index: Makefile
===================================================================
RCS file: /projects/cvsroot/pgsql/contrib/pg_trgm/Makefile,v
retrieving revision 1.6
diff -c -r1.6 Makefile
*** Makefile    9 Feb 2007 17:24:33 -0000       1.6
--- Makefile    22 Feb 2007 13:57:24 -0000
***************
*** 1,7 ****
  # $PostgreSQL: pgsql/contrib/pg_trgm/Makefile,v 1.6 2007/02/09 17:24:33 
petere Exp $
  
  MODULE_big = pg_trgm
! OBJS = trgm_op.o trgm_gist.o 
  
  DATA_built = pg_trgm.sql
  DATA = uninstall_pg_trgm.sql
--- 1,7 ----
  # $PostgreSQL: pgsql/contrib/pg_trgm/Makefile,v 1.6 2007/02/09 17:24:33 
petere Exp $
  
  MODULE_big = pg_trgm
! OBJS = trgm_op.o trgm_gist.o trgm_gin.o
  
  DATA_built = pg_trgm.sql
  DATA = uninstall_pg_trgm.sql
Index: uninstall_pg_trgm.sql
===================================================================
RCS file: /projects/cvsroot/pgsql/contrib/pg_trgm/uninstall_pg_trgm.sql,v
retrieving revision 1.2
diff -c -r1.2 uninstall_pg_trgm.sql
*** uninstall_pg_trgm.sql       13 Mar 2006 18:04:58 -0000      1.2
--- uninstall_pg_trgm.sql       22 Feb 2007 13:57:24 -0000
***************
*** 20,25 ****
--- 20,33 ----
  
  DROP TYPE gtrgm CASCADE;
  
+ DROP OPERATOR CLASS gin_trgm_ops USING gin;
+ 
+ DROP FUNCTION gin_extract_trgm(text, internal);
+ 
+ DROP FUNCTION gin_extract_trgm(text, internal, internal);
+ 
+ DROP FUNCTION gin_trgm_consistent(internal, internal, text);
+ 
  DROP OPERATOR % (text, text);
  
  DROP FUNCTION similarity_op(text,text);
Index: pg_trgm.sql.in
===================================================================
RCS file: /projects/cvsroot/pgsql/contrib/pg_trgm/pg_trgm.sql.in,v
retrieving revision 1.2
diff -c -r1.2 pg_trgm.sql.in
*** pg_trgm.sql.in      27 Feb 2006 16:09:48 -0000      1.2
--- pg_trgm.sql.in      22 Feb 2007 13:57:24 -0000
***************
*** 36,42 ****
          JOIN = contjoinsel
  );
  
! --gist key
  CREATE FUNCTION gtrgm_in(cstring)
  RETURNS gtrgm
  AS 'MODULE_PATHNAME'
--- 36,42 ----
          JOIN = contjoinsel
  );
  
! -- gist key
  CREATE FUNCTION gtrgm_in(cstring)
  RETURNS gtrgm
  AS 'MODULE_PATHNAME'
***************
*** 53,59 ****
          OUTPUT = gtrgm_out
  );
  
! -- support functions
  CREATE FUNCTION gtrgm_consistent(gtrgm,internal,int4)
  RETURNS bool
  AS 'MODULE_PATHNAME'
--- 53,59 ----
          OUTPUT = gtrgm_out
  );
  
! -- support functions for gist
  CREATE FUNCTION gtrgm_consistent(gtrgm,internal,int4)
  RETURNS bool
  AS 'MODULE_PATHNAME'
***************
*** 89,95 ****
  AS 'MODULE_PATHNAME'
  LANGUAGE C;
  
! -- create the operator class
  CREATE OPERATOR CLASS gist_trgm_ops
  FOR TYPE text USING gist
  AS
--- 89,95 ----
  AS 'MODULE_PATHNAME'
  LANGUAGE C;
  
! -- create the operator class for gist
  CREATE OPERATOR CLASS gist_trgm_ops
  FOR TYPE text USING gist
  AS
***************
*** 103,107 ****
--- 103,133 ----
          FUNCTION        7       gtrgm_same (gtrgm, gtrgm, internal),
          STORAGE         gtrgm;
  
+ -- support functions for gin
+ CREATE FUNCTION gin_extract_trgm(text, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
+ CREATE FUNCTION gin_extract_trgm(text, internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
+ CREATE FUNCTION gin_trgm_consistent(internal, internal, text)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
+ -- create the operator class for gin
+ CREATE OPERATOR CLASS gin_trgm_ops
+ FOR TYPE text USING gin
+ AS
+         OPERATOR        1       % (text, text) RECHECK,
+         FUNCTION        1       btint4cmp (int4, int4),
+         FUNCTION        2       gin_extract_trgm (text, internal),
+         FUNCTION        3       gin_extract_trgm (text, internal, internal),
+         FUNCTION        4       gin_trgm_consistent (internal, internal, 
text),
+         STORAGE         int4;
  
  COMMIT;
Index: trgm.h
===================================================================
RCS file: /projects/cvsroot/pgsql/contrib/pg_trgm/trgm.h,v
retrieving revision 1.5
diff -c -r1.5 trgm.h
*** trgm.h      14 Jul 2006 05:28:27 -0000      1.5
--- trgm.h      22 Feb 2007 13:57:24 -0000
***************
*** 28,33 ****
--- 28,34 ----
        *(((char*)(a))+2) = *(((char*)(b))+2);  \
  } while(0);
  
+ #define TRGMINT(a) ( 
(*(((char*)(a))+2)<<16)+(*(((char*)(a))+1)<<8)+*(((char*)(a))+0) )
  
  typedef struct
  {
Index: contrib/pg_trgm/trgm_gin.c
===================================================================
RCS file: contrib/pg_trgm/trgm_gin.c
diff -N contrib/pg_trgm/trgm_gin.c
*** /dev/null   1 Jan 1970 00:00:00 -0000
--- contrib/pg_trgm/trgm_gin.c  1 Jan 1970 00:00:00 -0000
***************
*** 0 ****
--- 1,77 ----
+ #include "trgm.h"
+ 
+ #include "access/gin.h"
+ #include "access/itup.h"
+ #include "access/tuptoaster.h"
+ #include "storage/bufpage.h"
+ #include "utils/array.h"
+ #include "utils/builtins.h"
+ 
+ PG_FUNCTION_INFO_V1(gin_extract_trgm);
+ Datum         gin_extract_trgm(PG_FUNCTION_ARGS);
+ 
+ PG_FUNCTION_INFO_V1(gin_trgm_consistent);
+ Datum         gin_trgm_consistent(PG_FUNCTION_ARGS);
+ 
+ Datum
+ gin_extract_trgm(PG_FUNCTION_ARGS)
+ {
+       text            *val = (text *) PG_GETARG_TEXT_P(0);
+       int32           *nentries = (int32 *) PG_GETARG_POINTER(1);
+       Datum           *entries = NULL;
+       TRGM            *trg;
+       int4            trglen;
+       
+       *nentries = 0;
+       
+       trg = generate_trgm(VARDATA(val), VARSIZE(val) - VARHDRSZ);
+       trglen = ARRNELEM(trg);
+       
+       if (trglen > 0)
+       {
+               trgm    *ptr;
+               int4    i = 0,
+                               item;
+               
+               *nentries = (int32) trglen;
+               entries = (Datum *) palloc(sizeof(Datum) * trglen);
+ 
+               ptr = GETARR(trg);
+               while (ptr - GETARR(trg) < ARRNELEM(trg))
+               {
+                       item = TRGMINT(ptr);
+                       entries[i++] = Int32GetDatum(item);
+                       
+                       ptr++;
+               }
+       }
+ 
+       PG_RETURN_POINTER(entries);
+ }
+ 
+ Datum
+ gin_trgm_consistent(PG_FUNCTION_ARGS)
+ {
+       bool            *check = (bool *) PG_GETARG_POINTER(0);
+       text            *query = (text *) PG_GETARG_TEXT_P(2);
+       bool            res = FALSE;
+       TRGM            *trg;
+       int4            i,
+                               trglen,
+                               ntrue = 0;
+       
+       trg = generate_trgm(VARDATA(query), VARSIZE(query) - VARHDRSZ);
+       trglen = ARRNELEM(trg);
+       
+       for (i = 0; i < trglen; i++)
+               if (check[i])
+                       ntrue ++;
+ 
+ #ifdef DIVUNION
+       res = (trglen == ntrue) ? true : ((((((float4) ntrue) / ((float4) 
(trglen - ntrue)))) >= trgm_limit) ? true : false);
+ #else
+       res = (trglen == 0) ? false : ((((((float4) ntrue) / ((float4) 
trglen))) >= trgm_limit) ? true : false);
+ #endif
+ 
+       PG_RETURN_BOOL(res);
+ }
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to