[PATCHES] Fix for stop words in thesaurus file

Bruce Momjian Thu, 08 Nov 2007 18:32:49 -0800

Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Tom Lane wrote:
> >> One possible real solution would be to tweak the dictionary APIs so
> >> that the dictionaries can find out whether this is the first load during
> >> a session, or a reload, and emit notices only in the first case.
> 
> > Yea, that would work too.  Or just throw an error for a stop word in the
> > file and then you never get a reload (use "*" instead).
> 
> Hm, that's a thought --- it'd be a way to solve the problem without an
> API change for dictionaries, which is something to avoid at this late
> stage of the 8.3 cycle.  Come to think of it, does the ts_cache stuff
> work properly when an error is thrown in dictionary load (ie, is the
> cache entry left in a sane state)?


I have developed the attached patch which uses "?" to mark stop words in
the thesaurus file.  ("*" was already in use in the file.)  I updated
the docs to use "?", which makes the documentation clearer too.

The patch also reenables testing of stop words in the thesuarus file.

FYI, there is no longer a NOTICE for stop words in the thesaurus file; 
it throws an error now, and says to use "?" instead.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.30
diff -c -c -r1.30 textsearch.sgml
*** doc/src/sgml/textsearch.sgml	5 Nov 2007 15:55:53 -0000	1.30
--- doc/src/sgml/textsearch.sgml	9 Nov 2007 02:26:17 -0000
***************
*** 2258,2277 ****
     </para>
  
     <para>
!     Stop words recognized by the subdictionary are replaced by a <quote>stop
!     word placeholder</quote> to record their position. To illustrate this,
!     consider these phrases:
  
  <programlisting>
! a one the two : swsw
! the one a two : swsw2
  </programlisting>
  
!     Assuming that <literal>a</> and <literal>the</> are stop words according
!     to the subdictionary, these two phrases are identical to the thesaurus:
!     they both look like <replaceable>stopword</> <literal>one</>
!     <replaceable>stopword</> <literal>two</>.  Input matching this pattern
!     will be replaced by <literal>swsw2</>, according to the tie-breaking rule.
     </para>
  
     <para>
--- 2258,2274 ----
     </para>
  
     <para>
!     Specific stop words recognized by the subdictionary cannot be
!     specified;  instead use <literal>?</> to mark the location where any
!     stop word can appear.  For example, assuming that <literal>a</> and
!     <literal>the</> are stop words according to the subdictionary:
  
  <programlisting>
! ? one ? two : swsw
  </programlisting>
  
!     matches <literal>a one the two</> and <literal>the one a two</>;
!     both would be replaced by <literal>swsw</>.
     </para>
  
     <para>
Index: src/backend/tsearch/dict_thesaurus.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/dict_thesaurus.c,v
retrieving revision 1.5
diff -c -c -r1.5 dict_thesaurus.c
*** src/backend/tsearch/dict_thesaurus.c	9 Nov 2007 01:32:22 -0000	1.5
--- src/backend/tsearch/dict_thesaurus.c	9 Nov 2007 02:26:17 -0000
***************
*** 412,458 ****
  	{
  		TSLexeme   *ptr;
  
! 		ptr = (TSLexeme *) DatumGetPointer(FunctionCall4(&(d->subdict->lexize),
! 									   PointerGetDatum(d->subdict->dictData),
! 										  PointerGetDatum(d->wrds[i].lexeme),
! 									Int32GetDatum(strlen(d->wrds[i].lexeme)),
! 													 PointerGetDatum(NULL)));
! 
! 		if (!ptr)
! 			elog(ERROR, "thesaurus word-sample \"%s\" isn't recognized by subdictionary (rule %d)",
! 				 d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
! 		else if (!(ptr->lexeme))
! 		{
! 			elog(NOTICE, "thesaurus word-sample \"%s\" is recognized as stop-word, assign any stop-word (rule %d)",
! 				 d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
! 
  			newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, NULL, d->wrds[i].entries, 0);
- 		}
  		else
  		{
! 			while (ptr->lexeme)
  			{
! 				TSLexeme   *remptr = ptr + 1;
! 				int			tnvar = 1;
! 				int			curvar = ptr->nvariant;
! 
! 				/* compute n words in one variant */
! 				while (remptr->lexeme)
  				{
! 					if (remptr->nvariant != (remptr - 1)->nvariant)
! 						break;
! 					tnvar++;
! 					remptr++;
! 				}
! 
! 				remptr = ptr;
! 				while (remptr->lexeme && remptr->nvariant == curvar)
! 				{
! 					newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, remptr, d->wrds[i].entries, tnvar);
! 					remptr++;
  				}
- 
- 				ptr = remptr;
  			}
  		}
  
--- 412,459 ----
  	{
  		TSLexeme   *ptr;
  
! 		if (strcmp(d->wrds[i].lexeme, "?") == 0)	/* Is stop word marker? */
  			newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, NULL, d->wrds[i].entries, 0);
  		else
  		{
! 			ptr = (TSLexeme *) DatumGetPointer(FunctionCall4(&(d->subdict->lexize),
! 										   PointerGetDatum(d->subdict->dictData),
! 											  PointerGetDatum(d->wrds[i].lexeme),
! 										Int32GetDatum(strlen(d->wrds[i].lexeme)),
! 														 PointerGetDatum(NULL)));
! 	
! 			if (!ptr)
! 				elog(ERROR, "thesaurus word-sample \"%s\" isn't recognized by subdictionary (rule %d)",
! 					 d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
! 			else if (!(ptr->lexeme))
! 				elog(ERROR, "thesaurus word-sample \"%s\" is recognized as stop-word, use \"?\" for stop words instead (rule %d)",
! 					 d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
! 			else
  			{
! 				while (ptr->lexeme)
  				{
! 					TSLexeme   *remptr = ptr + 1;
! 					int			tnvar = 1;
! 					int			curvar = ptr->nvariant;
! 	
! 					/* compute n words in one variant */
! 					while (remptr->lexeme)
! 					{
! 						if (remptr->nvariant != (remptr - 1)->nvariant)
! 							break;
! 						tnvar++;
! 						remptr++;
! 					}
! 	
! 					remptr = ptr;
! 					while (remptr->lexeme && remptr->nvariant == curvar)
! 					{
! 						newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, remptr, d->wrds[i].entries, tnvar);
! 						remptr++;
! 					}
! 	
! 					ptr = remptr;
  				}
  			}
  		}
  
Index: src/backend/tsearch/thesaurus_sample.ths
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/thesaurus_sample.ths,v
retrieving revision 1.2
diff -c -c -r1.2 thesaurus_sample.ths
*** src/backend/tsearch/thesaurus_sample.ths	23 Sep 2007 15:58:58 -0000	1.2
--- src/backend/tsearch/thesaurus_sample.ths	9 Nov 2007 02:26:17 -0000
***************
*** 14,17 ****
  supernovae stars : *sn
  supernovae : *sn
  booking tickets : order invitation cards
! # booking the tickets : order invitation Cards
--- 14,18 ----
  supernovae stars : *sn
  supernovae : *sn
  booking tickets : order invitation cards
! booking ? tickets : order invitation Cards
! 
Index: src/test/regress/expected/tsdicts.out
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/expected/tsdicts.out,v
retrieving revision 1.3
diff -c -c -r1.3 tsdicts.out
*** src/test/regress/expected/tsdicts.out	23 Oct 2007 20:46:12 -0000	1.3
--- src/test/regress/expected/tsdicts.out	9 Nov 2007 02:26:20 -0000
***************
*** 311,318 ****
  (1 row)
  
  SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
!                              to_tsvector                             
! ---------------------------------------------------------------------
!  'book':8 'card':3 'like':6 'look':5 'invit':2 'order':1 'ticket':10
  (1 row)
  
--- 311,318 ----
  (1 row)
  
  SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
!                       to_tsvector                      
! -------------------------------------------------------
!  'card':3,10 'like':6 'look':5 'invit':2,9 'order':1,8
  (1 row)

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

[PATCHES] Fix for stop words in thesaurus file

Reply via email to