Re: [HACKERS] TS: Limited cover density ranking

2012-01-27 Thread Sushant Sinha
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?

-Sushant.

On Fri, 2012-01-27 at 18:06 +0200, karave...@mail.bg wrote:
> Hello, 
> 
> I have developed a variation of cover density ranking functions that
> counts only covers that are lesser than a specified limit. It is
> useful for finding combinations of terms that appear nearby one
> another. Here is an example of usage: 
> 
> -- normal cover density ranking : not changed 
> luben=> select ts_rank_cd(to_tsvector('a b c d e g h i j k'),
> to_tsquery('a&d')); 
> ts_rank_cd 
>  
> 0.033 
> (1 row) 
> 
> -- limited to 2 
> luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'),
> to_tsquery('a&d')); 
> ts_rank_cd 
>  
> 0 
> (1 row) 
> 
> luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'),
> to_tsquery('a&d')); 
> ts_rank_cd 
>  
> 0.1 
> (1 row) 
> 
> -- limited to 3 
> luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'),
> to_tsquery('a&d')); 
> ts_rank_cd 
>  
> 0.033 
> (1 row) 
> 
> luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'),
> to_tsquery('a&d')); 
> ts_rank_cd 
>  
> 0.13 
> (1 row) 
> 
> Find attached a path agains 9.1.2 sources. I preferred to make a
> patch, not a separate extension because it is only 1 statement change
> in calc_rank_cd function. If I have to make an extension a lot of code
> would be duplicated between backend/utils/adt/tsrank.c and the
> extension. 
> 
> I have some questions: 
> 
> 1. Is it interesting to develop it further (documentation, cleanup,
> etc) for inclusion in one of the next versions? If this is the case,
> there are some further questions: 
> 
> - should I overload ts_rank_cd (as in examples above and the patch) or
> should I define new set of functions, for example ts_rank_lcd ? 
> - should I define define this new sql level functions in core or
> should I go only with this 2 lines change in calc_rank_cd() and define
> the new functions as an extension? If we prefer the later, could I
> overload core functions with functions defined in extensions? 
> - and finally there is always the possibility to duplicate the code
> and make an independent extension. 
> 
> 2. If I run the patched version on cluster that was initialized with
> unpatched server, is there a way to register the new functions in the
> system catalog without reinitializing the cluster? 
> 
> Best regards 
> luben 
> 
> -- 
> Luben Karavelov



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] bug in ts_rank_cd

2010-12-21 Thread Sushant Sinha
There is a bug in ts_rank_cd. It does not correctly give rank when the
query lexeme is the first one in the tsvector.

Example:

select ts_rank_cd(to_tsvector('english', 'abc sdd'),
plainto_tsquery('english', 'abc'));   
 ts_rank_cd 

  0

select ts_rank_cd(to_tsvector('english', 'bcg abc sdd'),
plainto_tsquery('english', 'abc'));
 ts_rank_cd 

0.1

The problem is that the Cover finding algorithm ignores the lexeme at
the 0th position, I have attached a patch which fixes it. After the
patch the result is fine.

select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery(
'english', 'abc'));
 ts_rank_cd 

0.1

--- postgresql-9.0.0/src/backend/utils/adt/tsrank.c	2010-01-02 22:27:55.0 +0530
+++ postgres-9.0.0-tsrankbugfix/src/backend/utils/adt/tsrank.c	2010-12-21 18:39:57.0 +0530
@@ -551,7 +551,7 @@
 	memset(qr->operandexist, 0, sizeof(bool) * qr->query->size);
 
 	ext->p = 0x7fff;
-	ext->q = 0;
+	ext->q = -1;
 	ptr = doc + ext->pos;
 
 	/* find upper bound of cover from current position, move up */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] bug in ts_rank_cd

2010-12-21 Thread Sushant Sinha
MY PREV EMAIL HAD A PROBLEM. Please reply to this one
==

There is a bug in ts_rank_cd. It does not correctly give rank when the
query lexeme is the first one in the tsvector.

Example:

select ts_rank_cd(to_tsvector('english', 'abc sdd'),
plainto_tsquery('english', 'abc'));   
 ts_rank_cd 

  0

select ts_rank_cd(to_tsvector('english', 'bcg abc sdd'),
plainto_tsquery('english', 'abc'));
 ts_rank_cd 

0.1

The problem is that the Cover finding algorithm ignores the lexeme at
the 0th position, I have attached a patch which fixes it. After the
patch the result is fine.

select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery(
'english', 'abc'));
 ts_rank_cd 

0.1

--- postgresql-9.0.0/src/backend/utils/adt/tsrank.c	2010-01-02 22:27:55.0 +0530
+++ postgres-9.0.0-tsrankbugfix/src/backend/utils/adt/tsrank.c	2010-12-21 18:39:57.0 +0530
@@ -551,7 +551,7 @@
 	memset(qr->operandexist, 0, sizeof(bool) * qr->query->size);
 
 	ext->p = 0x7fff;
-	ext->q = 0;
+	ext->q = -1;
 	ptr = doc + ext->pos;
 
 	/* find upper bound of cover from current position, move up */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in ts_rank_cd

2010-12-22 Thread Sushant Sinha
Sorry for sounding the false alarm. I was not running the vanilla
postgres and that is why I was seeing that problem. Should have checked
with the vanilla one.

-Sushant

On Tue, 2010-12-21 at 23:03 -0500, Tom Lane wrote:
> Sushant Sinha  writes:
> > There is a bug in ts_rank_cd. It does not correctly give rank when the
> > query lexeme is the first one in the tsvector.
> 
> Hmm ... I cannot reproduce the behavior you're complaining of.
> You say
> 
> > select ts_rank_cd(to_tsvector('english', 'abc sdd'),
> > plainto_tsquery('english', 'abc'));   
> >  ts_rank_cd 
> > 
> >   0
> 
> but I get
> 
> regression=# select ts_rank_cd(to_tsvector('english', 'abc sdd'),
> regression(# plainto_tsquery('english', 'abc'));   
>  ts_rank_cd 
> 
> 0.1
> (1 row)
> 
> > The problem is that the Cover finding algorithm ignores the lexeme at
> > the 0th position,
> 
> As far as I can tell, there is no "0th position" --- tsvector counts
> positions from one.  The only way to see pos == 0 in the input to
> Cover() is if the tsvector has been stripped of position information.
> ts_rank_cd is documented to return 0 in that situation.  Your patch
> would have the effect of causing it to return some nonzero, but quite
> bogus, ranking.
> 
>   regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-12-22 Thread Sushant Sinha
Just a reminder that this patch is discussing  how to break url, emails etc
into its components.

On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane  wrote:

> [ sorry for not responding on this sooner, it's been hectic the last
>  couple weeks ]
>
> Sushant Sinha  writes:
> >> I looked at this patch a bit.  I'm fairly unhappy that it seems to be
> >> inventing a brand new mechanism to do something the ts parser can
> >> already do.  Why didn't you code the url-part mechanism using the
> >> existing support for compound words?
>
> > I am not familiar with compound word implementation and so I am not sure
> > how to split a url with compound word support. I looked into the
> > documentation for compound words and that does not say much about how to
> > identify components of a token.
>
> IIRC, the way that that works is associated with pushing a sub-state
> of the state machine in order to scan each compound-word part.  I don't
> have the details in my head anymore, though I recall having traced
> through it in the past.  Look at the state machine actions that are
> associated with producing the compound word tokens and sub-tokens.
>

I did look around for compound word support in postgres. In particular, I
read the documentation and code in tsearch/spell.c that seems to implement
the compound word support.

So in my understanding the way it works is:

1. Specify a dictionary of words in which each word will have applicable
prefix/suffix flags
2. Specify a flag file that provides prefix/suffix operations on those flags
3. flag z indicates that a word in the dictionary can participate in
compound word splitting
4. When a token matches words specified in the dictionary (after applying
affix/suffix operations), the matching words are emitted as sub-words of the
token (i.e., compound word)

If my above understanding is correct, then I think it will not be possible
to implement url/email splitting using the compound word support.

The main reason is that the compound word support requires the
"PRE-DETERMINED" dictionary of words. So to split a url/email we will need
to provide a list of *all possible* host names and user names. I do not
think that is a possibility.

Please correct me if I have mis-understood something.

-Sushant.


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2011-01-06 Thread Sushant Sinha
Do not know if this mail got lost in between or no one noticed it!

On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote:
Just a reminder that this patch is discussing  how to break url, emails
etc into its components.
> 
> On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane  wrote:
> [ sorry for not responding on this sooner, it's been hectic
> the last
>  couple weeks ]
> 
> Sushant Sinha  writes:
> 
> >> I looked at this patch a bit.  I'm fairly unhappy that it
> seems to be
> >> inventing a brand new mechanism to do something the ts
> parser can
> >> already do.  Why didn't you code the url-part mechanism
> using the
> >> existing support for compound words?
> 
> > I am not familiar with compound word implementation and so I
> am not sure
> > how to split a url with compound word support. I looked into
> the
> > documentation for compound words and that does not say much
> about how to
> > identify components of a token.
> 
> 
> IIRC, the way that that works is associated with pushing a
> sub-state
> of the state machine in order to scan each compound-word
> part.  I don't
> have the details in my head anymore, though I recall having
> traced
> through it in the past.  Look at the state machine actions
> that are
> associated with producing the compound word tokens and
> sub-tokens.
> 

I did look around for compound word support in postgres. In particular,
I read the documentation and code in tsearch/spell.c that seems to
implement the compound word support. 

So in my understanding the way it works is:

1. Specify a dictionary of words in which each word will have applicable
prefix/suffix flags

2. Specify a flag file that provides prefix/suffix operations on those
flags

3. flag z indicates that a word in the dictionary can participate in
compound word splitting

4. When a token matches words specified in the dictionary (after
applying affix/suffix operations), the matching words are emitted as
sub-words of the token (i.e., compound word)

If my above understanding is correct, then I think it will not be
possible to implement url/email splitting using the compound word
support.

The main reason is that the compound word support requires the
"PRE-DETERMINED" dictionary of words. So to split a url/email we will
need to provide a list of *all possible* host names and user names. I do
not think that is a possibility.

Please correct me if I have mis-understood something.

-Sushant. 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2012-08-15 Thread Sushant Sinha
I will do the profiling and present the results.

On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Is this a TODO?
> 
> AFAIR nothing's been done about the speed issue, so yes.  I didn't
> like the idea of creating a user-visible knob when the speed issue
> might be fixable with internal algorithm improvements, but we never
> followed up on this in either fashion.
> 
>   regards, tom lane
> 
> > ---
> 
> > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> >> Sushant Sinha  writes:
> >>> Doesn't this force the headline to be taken from the first N words of
> >>> the document, independent of where the match was?  That seems rather
> >>> unworkable, or at least unhelpful.
> >> 
> >>> In headline generation function, we don't have any index or knowledge of
> >>> where the match is. We discover the matches by first tokenizing and then
> >>> comparing the matches with the query tokens. So it is hard to do
> >>> anything better than first N words.
> >> 
> >> After looking at the code in wparser_def.c a bit more, I wonder whether
> >> this patch is doing what you think it is.  Did you do any profiling to
> >> confirm that tokenization is where the cost is?  Because it looks to me
> >> like the match searching in hlCover() is at least O(N^2) in the number
> >> of tokens in the document, which means it's probably the dominant cost
> >> for any long document.  I suspect that your patch helps not so much
> >> because it saves tokenization costs as because it bounds the amount of
> >> effort spent in hlCover().
> >> 
> >> I haven't tried to do anything about this, but I wonder whether it
> >> wouldn't be possible to eliminate the quadratic blowup by saving more
> >> state across the repeated calls to hlCover().  At the very least, it
> >> shouldn't be necessary to find the last query-token occurrence in the
> >> document from scratch on each and every call.
> >> 
> >> Actually, this code seems probably flat-out wrong: won't every
> >> successful call of hlCover() on a given document return exactly the same
> >> q value (end position), namely the last token occurrence in the
> >> document?  How is that helpful?
> >> 
> >> regards, tom lane
> >> 
> >> -- 
> >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-hackers
> 
> > -- 
> >   Bruce Momjian  http://momjian.us
> >   EnterpriseDB http://enterprisedb.com
> 
> >   + It's impossible for everything to be true. +
> 
> 
> > -- 
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-08-31 Thread Sushant Sinha
I have attached a patch that emits parts of a host token, a url token,
an email token and a file token. Further, it makes sure that a
host/url/email/file token and the first part-token are at the same
position in tsvector.

The two major changes are:

1. Tokenization changes: The patch exploits the special handlers in the
text parser to reset the parser position to the start of a
host/url/email/file token when it finds one. Special handlers were
already used for extracting host and urlpath from a full url. So this is
more of an extension of the same idea.

2. Position changes: We do not advance position when we encounter a
host/url/email/file token. As a result the first part of that token
aligns with the token itself.

Attachments:

tokens_output.txt: sample queries and results with the patch
token_v1.patch:patch wrt cvs head

Currently, the patch output parts of the tokens as normal tokens like
WORD, NUMWORD etc. Tom argued earlier that this will break
backward-compatibility and so it should be outputted as parts of the
respective tokens. If there is an agreement over what Tom says, then the
current patch can be modified to output subtokens as parts. However,
before I complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.

-Sushant.

On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
> Sushant Sinha  writes:
> >> This would needlessly increase the number of tokens. Instead you'd 
> >> better make it work like compound word support, having just "wikipedia" 
> >> and "org" as tokens.
> 
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match. 
> 
> The suggestion to make it work like compound words is still a good one,
> ie given wikipedia.org you'd get back
> 
>   hostwikipedia.org
>   host-part   wikipedia
>   host-part   org
> 
> not just the "host" token as at present.
> 
> Then the user could decide whether he needed to index hostname
> components or not, by choosing whether to forward hostname-part
> tokens to a dictionary or just discard them.
> 
> If you submit a patch that tries to force the issue by classifying
> hostname parts as plain words, it'll probably get rejected out of
> hand on backwards-compatibility grounds.
> 
>   regards, tom lane

1. FILEPATH

testdb=# SELECT ts_debug('/stuff/index.html');
 ts_debug   
  

--
 (file,"File or path name",/stuff/index.html,{simple},simple,{/stuff/index.html}
)
 (blank,"Space symbols",/,{},,)
 (asciiword,"Word, all ASCII",stuff,{english_stem},english_stem,{stuff})
 (blank,"Space symbols",/,{},,)
 (asciiword,"Word, all ASCII",index,{english_stem},english_stem,{index})
 (blank,"Space symbols",.,{},,)
 (asciiword,"Word, all ASCII",html,{english_stem},english_stem,{html})


SELECT to_tsvector('english', '/stuff/index.html');
to_tsvector 

 '/stuff/index.html':0 'html':2 'index':1 'stuff':0
(1 row)

2. URL

testdb=# SELECT ts_debug('http://example.com/stuff/index.html');
   ts_debug 
   

---
 (protocol,"Protocol head",http://,{},,)
 (url,URL,example.com/stuff/index.html,{simple},simple,{example.com/stuff/index.
html})
 (host,Host,example.com,{simple},simple,{example.com})
 (asciiword,"Word, all ASCII",example,{english_stem},english_stem,{exampl})
 (blank,"Space symbols",.,{},,)
 (asciiword,"Word, all ASCII",com,{english_stem},english_stem,{com})
 (url_path,"URL path",/stuff/index.html,{simple},simple,{/stuff/index.html})
 (blank,"Space symbols",/,{},,)
 (asciiword,"Word, all ASCII",stuff,{english_stem},english_stem,{stuff})
 (blank,"Space symbols",/,{},,)
 (asciiword,"Word, all ASCII",index,{english_stem},english_stem,{index})
 (blank,"Space symbols",.,{},,)
 (asciiword,"Word, all ASCII",html,{english_stem},english_stem,{html})
(13 rows)

testdb=# SELECT to_tsvector('english', 'http://example.com/stuff/index.html');
  

Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-04 Thread Sushant Sinha
Updating the patch with emitting parttoken and registering it with
snowball config.

-Sushant.

On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote:
> On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha  wrote:
> > I have attached a patch that emits parts of a host token, a url token,
> > an email token and a file token. Further, it makes sure that a
> > host/url/email/file token and the first part-token are at the same
> > position in tsvector.
> 
> You should probably add this patch here:
> 
> https://commitfest.postgresql.org/action/commitfest_view/open
> 

Index: src/backend/snowball/snowball.sql.in
===
RCS file: /projects/cvsroot/pgsql/src/backend/snowball/snowball.sql.in,v
retrieving revision 1.6
diff -u -r1.6 snowball.sql.in
--- src/backend/snowball/snowball.sql.in	27 Oct 2007 16:01:08 -	1.6
+++ src/backend/snowball/snowball.sql.in	4 Sep 2010 02:59:10 -
@@ -22,6 +22,6 @@
 	WITH _ASCDICTNAME_;
 
 ALTER TEXT SEARCH CONFIGURATION _CFGNAME_ ADD MAPPING
-FOR word, hword_part, hword
+FOR word, hword_part, hword, parttoken
 	WITH _NONASCDICTNAME_;
 
Index: src/backend/tsearch/ts_parse.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/ts_parse.c,v
retrieving revision 1.17
diff -u -r1.17 ts_parse.c
--- src/backend/tsearch/ts_parse.c	26 Feb 2010 02:01:05 -	1.17
+++ src/backend/tsearch/ts_parse.c	4 Sep 2010 02:59:11 -
@@ -19,7 +19,7 @@
 #include "tsearch/ts_utils.h"
 
 #define IGNORE_LONGLEXEME	1
-
+#define COMPLEX_TOKEN(x) ( x == 4 || x == 5 || x == 6 || x == 18 || x == 17 || x == 18 || x == 19)   
 /*
  * Lexize subsystem
  */
@@ -407,8 +407,6 @@
 		{
 			TSLexeme   *ptr = norms;
 
-			prs->pos++;			/* set pos */
-
 			while (ptr->lexeme)
 			{
 if (prs->curwords == prs->lenwords)
@@ -429,6 +427,10 @@
 prs->curwords++;
 			}
 			pfree(norms);
+
+			if (!COMPLEX_TOKEN(type)) 
+prs->pos++;			/* set pos */
+
 		}
 	} while (type > 0);
 
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.33
diff -u -r1.33 wparser_def.c
--- src/backend/tsearch/wparser_def.c	19 Aug 2010 05:57:34 -	1.33
+++ src/backend/tsearch/wparser_def.c	4 Sep 2010 02:59:12 -
@@ -23,7 +23,7 @@
 
 
 /* Define me to enable tracing of parser behavior */
-/* #define WPARSER_TRACE */
+//#define WPARSER_TRACE 
 
 
 /* Output token categories */
@@ -51,8 +51,9 @@
 #define SIGNEDINT		21
 #define UNSIGNEDINT		22
 #define XMLENTITY		23
+#define PARTTOKEN		24
 
-#define LASTNUM			23
+#define LASTNUM			24
 
 static const char *const tok_alias[] = {
 	"",
@@ -78,7 +79,8 @@
 	"float",
 	"int",
 	"uint",
-	"entity"
+	"entity",
+	"parttoken"
 };
 
 static const char *const lex_descr[] = {
@@ -105,7 +107,8 @@
 	"Decimal notation",
 	"Signed integer",
 	"Unsigned integer",
-	"XML entity"
+	"XML entity",
+"Part of file/url/host/email"
 };
 
 
@@ -249,7 +252,8 @@
 	TParserPosition *state;
 	bool		ignore;
 	bool		wanthost;
-
+	int 		partstop;
+	TParserState	afterpart;
 	/* silly char */
 	char		c;
 
@@ -617,8 +621,41 @@
 	}
 	return 1;
 }
+static int
+p_ispartbingo(TParser *prs)
+{
+	int ret = 0;
+	if (prs->partstop > 0)
+	{
+		ret = 1;
+		if (prs->partstop <= prs->state->posbyte)	
+		{
+			prs->state->state = prs->afterpart;
+			prs->partstop = 0;
+		}
+		else
+			prs->state->state = TPS_Base;
+	}
+	return ret; 
+}
 
+static int
+p_ispart(TParser *prs)
+{
+	if (prs->partstop > 0)
+		return  1;
+	else
+		return 0;
+}
 
+static int
+p_ispartEOF(TParser *prs)
+{
+	if (p_ispart(prs) && p_isEOF(prs))
+ 		return 1;
+	else
+		return 0;
+}
 /* deliberately suppress unused-function complaints for the above */
 void		_make_compiler_happy(void);
 void
@@ -688,6 +725,21 @@
 }
 
 static void
+SpecialPart(TParser *prs)
+{
+	prs->partstop = prs->state->posbyte;
+	prs->state->posbyte -= prs->state->lenbytetoken;
+	prs->state->poschar -= prs->state->lenchartoken;
+	prs->afterpart = TPS_Base;
+}
+static void
+SpecialUrlPart(TParser *prs)
+{
+	SpecialPart(prs);
+	prs->afterpart = TPS_InURLPathStart;
+}
+
+static void
 SpecialVerVersion(TParser *prs)
 {
 	prs->state->posbyte -= prs->state->lenbytetoken;
@@ -1057,6 +1109,7 @@
 	{p_iseqC, '-', A_PUSH, TPS_InSignedIntFirst, 0, NULL},
 	{p_iseqC, '+', A_PUSH, TPS_InSignedIntFirst, 0, NULL},
 	{p_iseqC, '&', A_PUSH, TPS_InXMLEntityFirst, 0, NULL},
+	{p_ispart, 0, A_NEXT, TPS_InSpace, 0, NULL},
 	{p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL},
 	{p_iseqC, '/'

Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-08 Thread Sushant Sinha
For the headline generation to work properly, email/file/url/host need
to become skip tokens. Updating the patch with that change.

-Sushant.

On Sat, 2010-09-04 at 13:25 +0530, Sushant Sinha wrote:
> Updating the patch with emitting parttoken and registering it with
> snowball config.
> 
> -Sushant.
> 
> On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote:
> > On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha  wrote:
> > > I have attached a patch that emits parts of a host token, a url token,
> > > an email token and a file token. Further, it makes sure that a
> > > host/url/email/file token and the first part-token are at the same
> > > position in tsvector.
> > 
> > You should probably add this patch here:
> > 
> > https://commitfest.postgresql.org/action/commitfest_view/open
> > 
> 

Index: src/backend/snowball/snowball.sql.in
===
RCS file: /projects/cvsroot/pgsql/src/backend/snowball/snowball.sql.in,v
retrieving revision 1.6
diff -u -r1.6 snowball.sql.in
--- src/backend/snowball/snowball.sql.in	27 Oct 2007 16:01:08 -	1.6
+++ src/backend/snowball/snowball.sql.in	7 Sep 2010 01:46:55 -
@@ -22,6 +22,6 @@
 	WITH _ASCDICTNAME_;
 
 ALTER TEXT SEARCH CONFIGURATION _CFGNAME_ ADD MAPPING
-FOR word, hword_part, hword
+FOR word, hword_part, hword, parttoken
 	WITH _NONASCDICTNAME_;
 
Index: src/backend/tsearch/ts_parse.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/ts_parse.c,v
retrieving revision 1.17
diff -u -r1.17 ts_parse.c
--- src/backend/tsearch/ts_parse.c	26 Feb 2010 02:01:05 -	1.17
+++ src/backend/tsearch/ts_parse.c	7 Sep 2010 01:46:55 -
@@ -19,7 +19,7 @@
 #include "tsearch/ts_utils.h"
 
 #define IGNORE_LONGLEXEME	1
-
+#define COMPLEX_TOKEN(x) ( x == 4 || x == 5 || x == 6 || x == 18 || x == 17 || x == 18 || x == 19)   
 /*
  * Lexize subsystem
  */
@@ -407,8 +407,6 @@
 		{
 			TSLexeme   *ptr = norms;
 
-			prs->pos++;			/* set pos */
-
 			while (ptr->lexeme)
 			{
 if (prs->curwords == prs->lenwords)
@@ -429,6 +427,10 @@
 prs->curwords++;
 			}
 			pfree(norms);
+
+			if (!COMPLEX_TOKEN(type)) 
+prs->pos++;			/* set pos */
+
 		}
 	} while (type > 0);
 
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.33
diff -u -r1.33 wparser_def.c
--- src/backend/tsearch/wparser_def.c	19 Aug 2010 05:57:34 -	1.33
+++ src/backend/tsearch/wparser_def.c	7 Sep 2010 01:46:56 -
@@ -23,7 +23,7 @@
 
 
 /* Define me to enable tracing of parser behavior */
-/* #define WPARSER_TRACE */
+//#define WPARSER_TRACE 
 
 
 /* Output token categories */
@@ -51,8 +51,9 @@
 #define SIGNEDINT		21
 #define UNSIGNEDINT		22
 #define XMLENTITY		23
+#define PARTTOKEN		24
 
-#define LASTNUM			23
+#define LASTNUM			24
 
 static const char *const tok_alias[] = {
 	"",
@@ -78,7 +79,8 @@
 	"float",
 	"int",
 	"uint",
-	"entity"
+	"entity",
+	"parttoken"
 };
 
 static const char *const lex_descr[] = {
@@ -105,7 +107,8 @@
 	"Decimal notation",
 	"Signed integer",
 	"Unsigned integer",
-	"XML entity"
+	"XML entity",
+"Part of file/url/host/email"
 };
 
 
@@ -249,7 +252,8 @@
 	TParserPosition *state;
 	bool		ignore;
 	bool		wanthost;
-
+	int 		partstop;
+	TParserState	afterpart;
 	/* silly char */
 	char		c;
 
@@ -617,8 +621,41 @@
 	}
 	return 1;
 }
+static int
+p_ispartbingo(TParser *prs)
+{
+	int ret = 0;
+	if (prs->partstop > 0)
+	{
+		ret = 1;
+		if (prs->partstop <= prs->state->posbyte)	
+		{
+			prs->state->state = prs->afterpart;
+			prs->partstop = 0;
+		}
+		else
+			prs->state->state = TPS_Base;
+	}
+	return ret; 
+}
 
+static int
+p_ispart(TParser *prs)
+{
+	if (prs->partstop > 0)
+		return  1;
+	else
+		return 0;
+}
 
+static int
+p_ispartEOF(TParser *prs)
+{
+	if (p_ispart(prs) && p_isEOF(prs))
+ 		return 1;
+	else
+		return 0;
+}
 /* deliberately suppress unused-function complaints for the above */
 void		_make_compiler_happy(void);
 void
@@ -688,6 +725,21 @@
 }
 
 static void
+SpecialPart(TParser *prs)
+{
+	prs->partstop = prs->state->posbyte;
+	prs->state->posbyte -= prs->state->lenbytetoken;
+	prs->state->poschar -= prs->state->lenchartoken;
+	prs->afterpart = TPS_Base;
+}
+static void
+SpecialUrlPart(TParser *prs)
+{
+	SpecialPart(prs);
+	prs->afterpart = TPS_InURLPathStart;
+}
+
+static void
 SpecialVerVersion(TParser *prs)
 {
 	prs->state->posbyte -= prs->state->lenbytetoken;
@@ -1057,6 +1109,7 @@
 	{p_iseqC, '-', A_PUSH, TPS_InSign

Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-21 Thread Sushant Sinha
> I looked at this patch a bit.  I'm fairly unhappy that it seems to be
> inventing a brand new mechanism to do something the ts parser can
> already do.  Why didn't you code the url-part mechanism using the
> existing support for compound words? 

I am not familiar with compound word implementation and so I am not sure
how to split a url with compound word support. I looked into the
documentation for compound words and that does not say much about how to
identify components of a token. Does a compound word split by matching
with a list of words? If yes, then we will not be able to use that as we
do not know all the words that can appear in a url/host/email/file.

I think another approach can be to use the dict_regex dictionary
support. However, we will have to match the regex with something that
parser is doing. 

The current patch is not inventing any new mechanism. It uses the
special handler mechanism already present in the parser. For example,
when the current parser finds a URL it runs a special handler called
SpecialFURL which resets the parser position to the start of token to
find hostname. After finding the host it moves to finding the path. So
you first get the URL and then the host and finally the path.

Similarly, we are resetting the parser to the start of the token on
finding a url to output url parts. Then before entering the state that
can lead to a url we output the url part. The state machine modification
is similar for other tokens like file/email/host.


> The changes made to parsetext()
> seem particularly scary: it's not clear at all that that's not breaking
> unrelated behaviors.  In fact, the changes in the regression test
> results suggest strongly to me that it *is* breaking things.  Why are
> there so many diffs in examples that include no URLs at all?
> 

I think some of the difference is coming from the fact that now pos
starts with 0 and it used to be 1 earlier. That is easily fixable
though. 

> An issue that's nearly as bad is the 100% lack of documentation,
> which makes the patch difficult to review because it's hard to tell
> what it intends to accomplish or whether it's met the intent.
> The patch is not committable without documentation anyway, but right
> now I'm not sure it's even usefully reviewable.

I did not provide any explanation as I could not find any place in the
code to provide the documentation (that was just a modification of state
machine). Should I do a separate write-up to explain the desired output
and the changes to achieve it?

> 
> In line with the lack of documentation, I would say that the choice of
> the name "parttoken" for the new token type is not helpful.  Part of
> what?  And none of the other token type names include the word "token",
> so that's not a good decision either.  Possibly "url_part" would be a
> suitable name.
> 

I can modify it to output url-part/host-part/email-part/file-part if
there is an agreement over the rest of the issues. So let me know if I
should go ahead with this.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Configuring Text Search parser?

2010-09-21 Thread Sushant Sinha
Your changes are somewhat fine. It will get you tokens with "_"
characters in it. However, it is not nice to mix your new token with
existing token like NUMWORD. Give a new name to your new type of
token .. probably UnderscoreWord. Then on seeing "_", move to a state
that can identify the new token. If you finally recognize that token,
then output it.

In order to extract portions of the newly created token,  you can write
a special handler for the token that resets the parser position to the
start of the token to get parts of it. And then modify the state machine
to output the part-token before going into the state that can lead to
the token that was identified earlier.


Look at these changes to the text parser as well:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg4.php

-Sushant.


On Mon, 2010-09-20 at 16:01 +0200, jes...@krogh.cc wrote:
> Hi.
> 
> I'm trying to migrate an application off an existing Full Text Search engine
> and onto PostgreSQL .. one of my main (remaining) headaches are the
> fact that PostgreSQL treats _ as a seperation charachter whereas the existing
> behaviour is to "not split". That means:
> 
> testdb=# select ts_debug('database_tag_number_999');
>ts_debug
> --
>  (asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas})
>  (blank,"Space symbols",_,{},,)
>  (asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag})
>  (blank,"Space symbols",_,{},,)
>  (asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number})
>  (blank,"Space symbols",_,{},,)
>  (uint,"Unsigned integer",999,{simple},simple,{999})
> (7 rows)
> 
> Where the incoming data, by design contains a set of tags which includes _
> and are expected to be one "lexeme".
> 
> I've tried patching my way out of this using this patch.
> 
> $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
> src/backend/tsearch/wparser_def.c
> *** src/backend/tsearch/wparser_def.c.orig2010-09-20 15:58:37.06460
> +0200
> --- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200
> ***
> *** 967,986 
> --- 967,988 
> 
>   static const TParserStateActionItem actionTPS_InNumWord[] = {
>   {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
>   {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
>   {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},
>   {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
>   {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
>   {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
>   {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL},
>   {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL}
>   };
> 
>   static const TParserStateActionItem actionTPS_InAsciiWord[] = {
>   {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
>   {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
>   {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
>   {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
>   {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
>   {p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},
>   {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
> ***
> *** 995,1004 
> --- 997,1007 
> 
>   static const TParserStateActionItem actionTPS_InWord[] = {
>   {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
>   {p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL},
>   {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
>   {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},
>   {p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL},
>   {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL}
>   };
> 
> 
> 
> This will obviously break other peoples applications, so my questions would
> be: If this should be made configurable.. how should it be done?
> 
> As a sidenote... Xapian doesn't split on _ .. Lucene does.
> 
> Thanks.
> 
> -- 
> Jesper
> 
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-28 Thread Sushant Sinha
Any updates on this?


On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha wrote:

> > I looked at this patch a bit.  I'm fairly unhappy that it seems to be
> > inventing a brand new mechanism to do something the ts parser can
> > already do.  Why didn't you code the url-part mechanism using the
> > existing support for compound words?
>
> I am not familiar with compound word implementation and so I am not sure
> how to split a url with compound word support. I looked into the
> documentation for compound words and that does not say much about how to
> identify components of a token. Does a compound word split by matching
> with a list of words? If yes, then we will not be able to use that as we
> do not know all the words that can appear in a url/host/email/file.
>
> I think another approach can be to use the dict_regex dictionary
> support. However, we will have to match the regex with something that
> parser is doing.
>
> The current patch is not inventing any new mechanism. It uses the
> special handler mechanism already present in the parser. For example,
> when the current parser finds a URL it runs a special handler called
> SpecialFURL which resets the parser position to the start of token to
> find hostname. After finding the host it moves to finding the path. So
> you first get the URL and then the host and finally the path.
>
> Similarly, we are resetting the parser to the start of the token on
> finding a url to output url parts. Then before entering the state that
> can lead to a url we output the url part. The state machine modification
> is similar for other tokens like file/email/host.
>
>
> > The changes made to parsetext()
> > seem particularly scary: it's not clear at all that that's not breaking
> > unrelated behaviors.  In fact, the changes in the regression test
> > results suggest strongly to me that it *is* breaking things.  Why are
> > there so many diffs in examples that include no URLs at all?
> >
>
> I think some of the difference is coming from the fact that now pos
> starts with 0 and it used to be 1 earlier. That is easily fixable
> though.
>
> > An issue that's nearly as bad is the 100% lack of documentation,
> > which makes the patch difficult to review because it's hard to tell
> > what it intends to accomplish or whether it's met the intent.
> > The patch is not committable without documentation anyway, but right
> > now I'm not sure it's even usefully reviewable.
>
> I did not provide any explanation as I could not find any place in the
> code to provide the documentation (that was just a modification of state
> machine). Should I do a separate write-up to explain the desired output
> and the changes to achieve it?
>
> >
> > In line with the lack of documentation, I would say that the choice of
> > the name "parttoken" for the new token type is not helpful.  Part of
> > what?  And none of the other token type names include the word "token",
> > so that's not a good decision either.  Possibly "url_part" would be a
> > suitable name.
> >
>
> I can modify it to output url-part/host-part/email-part/file-part if
> there is an agreement over the rest of the issues. So let me know if I
> should go ahead with this.
>
> -Sushant.
>
>


Re: [HACKERS] Re: [GENERAL] Text search parser's treatment of URLs and emails

2010-10-12 Thread Sushant Sinha

On Tue, 2010-10-12 at 19:31 -0400, Tom Lane wrote:
> This seems much of a piece with the existing proposal to allow
> individual "words" of a URL to be reported separately:
> https://commitfest.postgresql.org/action/patch_view?id=378
> 
> As I said in that thread, this could be done in a backwards-compatible
> way using the tsearch parser's existing ability to report multiple
> overlapping tokens out of the same piece of text.  But I'd like to see
> one unified proposal and patch for this and Sushant's patch, not
> independent hacks changing the behavior in the same area.
> 
>   regards, tom lane
What Tom has suggested will require me to look into a different piece of
code and so this will take some time before I can update the patch.

-Sushant.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] planner row-estimates for tsvector seems horribly wrong

2010-10-24 Thread Sushant Sinha
I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).

The table has roughly a million docs.

I see a similar problem reported here but thought it was fixed in 9.0
which I am running. 

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01389.php

Here is the version info and detailed planner output for all the three
queries:


select version();

version 

 PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(Gentoo 4.3.4 p1.1, pie-10.1.5) 4.3.4, 64-bit


Case I: FOR A NON-MATCHING WORD
===

explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'dyfdfdf') as qdoc where  docvector @@ qdoc;
 QUERY
PLAN 

 Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=0.058..0.058 rows=1 loops=1)
   ->  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=0.055..0.055 rows=0 loops=1)
 ->  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.005..0.005 rows=1 loops=1)
 ->  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=0.046..0.046 rows=0 loops=1)
   Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
   ->  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
rows=4843 width=0) (actual time=0.044..0.044 rows=0 loops=1)
 Index Cond: (docmeta.docvector @@ qdoc.qdoc)
 Total runtime: 0.092 ms

CASE II: FOR A MEDIUM-MATCHING WORD
===
 explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'quit') as qdoc where  docvector @@ qdoc;
 QUERY
PLAN 

 Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=1222.856..1222.857 rows=1 loops=1)
   ->  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=639.275..1212.460 rows=88545 loops=1)
 ->  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.006..0.007 rows=1 loops=1)
 ->  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=639.264..1196.542 rows=88545 loops=1)
   Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
   ->  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
rows=4843 width=0) (actual time=621.877..621.877 rows=88545 loops=1)
 Index Cond: (docmeta.docvector @@ qdoc.qdoc)
 Total runtime: 1222.907 ms


Case II: FOR A HIGH-MATCHING WORD
=

explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'j') as qdoc where  docvector @@ qdoc;
 QUERY
PLAN  

 Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=742.857..742.858 rows=1 loops=1)
   ->  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=126.804..660.895 rows=726985 loops=1)
 ->  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.004..0.006 rows=1 loops=1)
 ->  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=126.795..530.422 rows=726985 loops=1)
   Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
   ->  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
rows=4843 width=0) (actual time=113.742..113.742 rows=726985 loops=1)
 Index Cond: (docmeta.docvector @@ qdoc.qdoc)
 Total runtime: 742.906 ms

Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] planner row-estimates for tsvector seems horribly wrong

2010-10-24 Thread Sushant Sinha
Thanks a ton Jan! It works quite correctly. But many tsearch tutorials
ask tsquery to be placed in 'from' statement and that can cause bad
plan. Isn't it possible to return the correct number for a join with the
query as well?

-Sushant.

On Sun, 2010-10-24 at 15:07 +0200, Jan Urbański wrote:
> On 24/10/10 14:44, Sushant Sinha wrote:
> > I am using gin index on a tsvector and doing basic search. I see the
> > row-estimate of the planner to be horribly wrong. It is returning
> > row-estimate as 4843 for all queries whether it matches zero rows, a
> > medium number of rows (88,000) or a large number of rows (726,000).
> > 
> > The table has roughly a million docs.
> 
> > explain analyze select count(*) from  docmeta,
> > plainto_tsquery('english', 'dyfdfdf') as qdoc where  docvector @@ qdoc;
> 
> OK, forget my previous message. The problem is that you are doing a join
> using @@ as the operator for the join condition, so the planner uses the
> operator's join selectivity estimate. For @@ the tsmatchjoinsel function
> simply returns 0.005.
> 
> Try doing:
> 
> explain analyze select count(*) from docmeta where docvector @@
> plainto_tsquery('english', 'dyfdfdf');
> 
> It should help.
> 
> Cheers,
> Jan



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_trgm: unicode string not working

2011-06-12 Thread Sushant Sinha
I am using pg_trgm for spelling correction as prescribed in the
documentation. But I see that it does not work for unicode sring. The
database was initialized with utf8 encoding and the C locale.

Here is the table:
 \d words
 Table "public.words"
 Column |  Type   | Modifiers 
+-+---
 word   | text| 
 ndoc   | integer | 
 nentry | integer | 
Indexes:
"words_idx" gin (word gin_trgm_ops)

Query: select word from words where word % 'कतद';

I get an error:

ERROR:  GIN indexes do not support whole-index scans


Any idea what is wrong?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch Parser Hacking

2011-02-14 Thread Sushant Sinha
I agree that it will be a good idea to rewrite the entire thing. However, in
the mean time, I sent a proposal earlier

http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php

And a patch later:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php

Tom asked me to look into Compound Word support but I found it not usable.
Here was my response:
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00419.php

I have not got any response since then,

-Sushant.


On Tue, Feb 15, 2011 at 9:33 AM, David E. Wheeler wrote:

> On Feb 14, 2011, at 3:57 PM, Tom Lane wrote:
>
> > There is zero, none, nada, provision for modifying the behavior of the
> > default parser, other than by changing its compiled-in state transition
> > tables.
> >
> > It doesn't help any that said tables are baroquely designed and utterly
> > undocumented.
> >
> > IMO, sooner or later we need to trash that code and replace it with
> > something a bit more modification-friendly.
>
> I was afraid you'd say that. Thanks.
>
> David
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


[HACKERS] english parser in text search: support for multiple words in the same position

2010-08-01 Thread Sushant Sinha
Currently the english parser in text search does not support multiple
words in the same position. Consider a word "wikipedia.org". The text
search would return a single token "wikipedia.org". However if someone
searches for "wikipedia org" then there will not be a match. There are
two problems here:

1. We do not have separate tokens "wikipedia" and "org"
2. If we have the two tokens we should have them at adjacent position so
that a phrase search for "wikipedia org" should work.

 It will be nice to have the following tokenization and positioning for
"wikipedia.org"

position 0: WORD(wikipedia), URL(wikipedia.org)
position 1: WORD(org)

Take the example of "wikipedia.org/search?q=sushant"

Here is the TSVECTOR:

select to_tsvector('english', 'wikipedia.org/search?q=sushant');

to_tsvector 

'/search?q=sushant':3 'wikipedia.org':2
'wikipedia.org/search?q=sushant':1

And here are the tokens:

select ts_debug('english', 'wikipedia.org/search?q=sushant');

ts_debug

(url,URL,wikipedia.org/search?q=sushant,{simple},simple,{wikipedia.org/search?q
=sushant})
 (host,Host,wikipedia.org,{simple},simple,{wikipedia.org})
 (url_path,"URL
path",/search?q=sushant,{simple},simple,{/search?q=sushant})

The tokenization I would like to see is:

position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
position 1: WORD(org)
position 2: WORD(search), URL_PATH(search/?q=sushant)
position 3: WORD(q), URL_QUERY(q=search)
position 4: WORD(sushant)

So what we need is to support multiple tokens at the same position. And
I need help in understanding how to realize this. Currently the position
assignment happens in make_tsvector by working or parsed lexemes. The
lexeme is obtained by prsd_nexttoken.

However, prsd_nexttoken only returns a single token. Will it be possiblt
to store some tokens and return them tokegher? Or can we put a flag to
certain tokens that say the position should not be increased?

-Sushant.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-08-02 Thread Sushant Sinha
> On 08/01/2010 08:04 PM, Sushant Sinha wrote:
> > 1. We do not have separate tokens "wikipedia" and "org"
> > 2. If we have the two tokens we should have them at adjacent position so
> > that a phrase search for "wikipedia org" should work.
> 
> This would needlessly increase the number of tokens. Instead you'd 
> better make it work like compound word support, having just "wikipedia" 
> and "org" as tokens.

The current text parser already returns url and url_path. That already
increases the number of unique tokens. I am only asking for adding of
normal english words as well so that if someone types only "wikipedia"
he gets a match. 

> 
> Searching for "wikipedia.org" or "wikipedia org" should then result in 
> the same search query with the two tokens: "wikipedia" and "org".

Earlier people have expressed the need to index urls/emails and
currently the text parser already does so. Reverting that would be a
regression of functionality. Further, a ranking function can take
advantage of direct match of a token.

> > position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
> 
> IMO the differentiation between WORDs and URLs is not something the text 
> search engine should have to take care a lot. Let it just do the 
> searching and make it do that well.

Postgres english parser already emits urls as tokens. Only thing I am
asking is on improving the tokenization and positioning.

> What does a token "wikipedia.org/search?q=sushant" buy you in terms of 
> text searching? Or even result highlighting? I wouldn't expect anybody 
> to want to search for a full URL, do you?

There have been need expressed in past. And an exact token match can
result in better ranking functions. For example, a tf-idf ranking will
rank matching of such unique tokens significantly higher.

-Sushant.

> Regards
> 
> Markus Wanner



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-08-02 Thread Sushant Sinha
On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote:
> On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha  wrote:
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match.
> [...]
> > Postgres english parser already emits urls as tokens. Only thing I am
> > asking is on improving the tokenization and positioning.
> 
> Can you write a patch to implement your idea?
> 

Yes thats what I am planning to do. I just wanted to see if anyone can
help me in estimating whether this is doable in the current parser or I
need to write a new one. If possible, then some idea on how to go about
implementing?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Sushant Sinha
I am using plpythonu on postgres 9.0.2. One of my python functions was
throwing a TypeError exception. However, I only see the exception in the
database and not the stack trace. It becomes difficult to debug if the
stack trace is absent in Python.

logdb=# select get_words(forminput) from fi;   
ERROR:  PL/Python: TypeError: an integer is required
CONTEXT:  PL/Python function "get_words"


And here is the error if I run that function on the same data in python:

Traceback (most recent call last):
  File "valid.py", line 215, in 
parse_query(result['forminput'])
  File "valid.py", line 132, in parse_query
dateobj = datestr_to_obj(columnHash[column])
  File "valid.py", line 37, in datestr_to_obj
dateobj = datetime.date(words[2], words[1], words[0])
TypeError: an integer is required


Is this a known problem or this needs addressing?

Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Sushant Sinha

On Thu, 2011-07-21 at 15:31 +0200, Jan Urbański wrote:
> On 21/07/11 15:27, Sushant Sinha wrote:
> > I am using plpythonu on postgres 9.0.2. One of my python functions was
> > throwing a TypeError exception. However, I only see the exception in the
> > database and not the stack trace. It becomes difficult to debug if the
> > stack trace is absent in Python.
> > 
> > logdb=# select get_words(forminput) from fi;   
> > ERROR:  PL/Python: TypeError: an integer is required
> > CONTEXT:  PL/Python function "get_words"
> > 
> > And here is the error if I run that function on the same data in python:
> > 
> > [traceback]
> > 
> > Is this a known problem or this needs addressing?
> 
> Yes, traceback support in PL/Python has already been implemented and is
> a new feature that will be available in PostgreSQL 9.1.
> 
> Cheers,
> Jan

Thanks Jan! Just one more reason to try 9.1.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha
Given a document and a query, the goal of headline generation is to
produce text excerpts in which the query appears. Currently the headline
generation in postgres follows the following steps:

1. Tokenize the documents and obtain the lexemes
2. Decide on lexemes that should be the part of the headline
3. Generate the headline

So the time taken by the headline generation is directly dependent on
the size of the document. The longer the document, the more time taken
to tokenize and more lexemes to operate on.

Most of the time is taken during the tokenization phase and for very big
documents, the headline generation is very expensive. 

Here is a simple patch that limits the number of words during the
tokenization phase and puts an upper-bound on the headline generation.
The headline function takes a parameter MaxParsedWords. If this
parameter is negative or not supplied, then the entire document is
tokenized  and operated on (the current behavior). However, if the
supplied MaxParsedWords is a positive number, then the tokenization
stops after MaxParsedWords is obtained. The remaining headline
generation happens on the tokens obtained till that point.

The current patch can be applied to 9.1rc1. It lacks changes to the
documentation and test cases. I will add them if you folks agree on the
functionality.

-Sushant.
diff -ru postgresql-9.1rc1/src/backend/tsearch/ts_parse.c postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c
--- postgresql-9.1rc1/src/backend/tsearch/ts_parse.c	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c	2011-08-23 21:27:10.0 +0530
@@ -525,10 +525,11 @@
 }
 
 void
-hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen)
+hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen, int max_parsed_words)
 {
 	int			type,
-lenlemm;
+lenlemm,
+numparsed = 0;
 	char	   *lemm = NULL;
 	LexizeData	ldata;
 	TSLexeme   *norms;
@@ -580,8 +581,8 @@
 			else
 addHLParsedLex(prs, query, lexs, NULL);
 		} while (norms);
-
-	} while (type > 0);
+		numparsed += 1;
+	} while (type > 0 && (max_parsed_words < 0 || numparsed < max_parsed_words));
 
 	FunctionCall1(&(prsobj->prsend), PointerGetDatum(prsdata));
 }
--- postgresql-9.1rc1/src/backend/tsearch/wparser.c	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/backend/tsearch/wparser.c	2011-08-23 21:30:12.0 +0530
@@ -304,6 +304,8 @@
 	text	   *out;
 	TSConfigCacheEntry *cfg;
 	TSParserCacheEntry *prsobj;
+	ListCell   *l;
+int max_parsed_words = -1;
 
 	cfg = lookup_ts_config_cache(PG_GETARG_OID(0));
 	prsobj = lookup_ts_parser_cache(cfg->prsId);
@@ -317,13 +319,21 @@
 	prs.lenwords = 32;
 	prs.words = (HeadlineWordEntry *) palloc(sizeof(HeadlineWordEntry) * prs.lenwords);
 
-	hlparsetext(cfg->cfgId, &prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ);
 
 	if (opt)
 		prsoptions = deserialize_deflist(PointerGetDatum(opt));
 	else
 		prsoptions = NIL;
 
+	foreach(l, prsoptions)
+	{
+		DefElem*defel = (DefElem *) lfirst(l);
+		char	   *val = defGetString(defel);
+		if (pg_strcasecmp(defel->defname, "MaxParsedWords") == 0)
+			max_parsed_words = pg_atoi(val, sizeof(int32), 0);
+}
+
+	hlparsetext(cfg->cfgId, &prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ, max_parsed_words);
 	FunctionCall3(&(prsobj->prsheadline),
   PointerGetDatum(&prs),
   PointerGetDatum(prsoptions),
diff -ru postgresql-9.1rc1/src/include/tsearch/ts_utils.h postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h
--- postgresql-9.1rc1/src/include/tsearch/ts_utils.h	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h	2011-08-23 21:04:14.0 +0530
@@ -98,7 +98,7 @@
  */
 
 extern void hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query,
-			char *buf, int4 buflen);
+			char *buf, int4 buflen, int max_parsed_words);
 extern text *generateHeadline(HeadlineParsedText *prs);
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha

> > Here is a simple patch that limits the number of words during the
> > tokenization phase and puts an upper-bound on the headline generation.
> 
> Doesn't this force the headline to be taken from the first N words of
> the document, independent of where the match was?  That seems rather
> unworkable, or at least unhelpful.
> 
>   regards, tom lane

In headline generation function, we don't have any index or knowledge of
where the match is. We discover the matches by first tokenizing and then
comparing the matches with the query tokens. So it is hard to do
anything better than first N words.


One option could be that we start looking for "good match" while
tokenizing and then stop if we have found good match. Currently the
algorithms that decide a good match operates independently of the
tokenization and there are two of them. So integrating them would not be
easy.

The patch is very helpful if you believe in the common case assumption
that most of the time a good match is at the top of the document.
Typically a search application generates headline for the top matches of
a query i.e., those in which the query terms appears frequently. So
there should be atleast one or two good text excerpt matches at the top
of the document.



-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha
>
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document?  How is that helpful?
>
>regards, tom lane
>

There is a line that saves the computation state from the previous call and
search only starts from there:

int pos = *p;


[HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
I am currently using the prefix search feature in text search. I find
that the prefix characters are treated the same as a normal lexeme and
passed through stemming and stopword dictionaries. This seems like a bug
to me. 

db=# select to_tsquery('english', 's:*');
NOTICE:  text-search query contains only stop words or doesn't contain
lexemes, ignored
 to_tsquery 

 
(1 row)

db=# select to_tsquery('simple', 's:*');
 to_tsquery 

 's':*
(1 row)


I also think that this is a mistake. It should only be highlighting "s".
db=# select ts_headline('sushant', to_tsquery('simple', 's:*'));
  ts_headline   

 sushant


Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
> On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
> > I am currently using the prefix search feature in text search. I find
> > that the prefix characters are treated the same as a normal lexeme and
> > passed through stemming and stopword dictionaries. This seems like a bug
> > to me.
> 
> Hm, I don't think so. If they don't pass through stopword dictionaries,
> then queries containing stopwords will fail to find any rows - which is
> probably not what one would expect.

I think what you are saying a feature is really a bug. I am fairly sure
that when someone says to_tsquery('english', 's:*') one is looking for
an entry that has a *non-stopword* word that starts with 's'. And
specially so in a text search configuration that eliminates stop words. 

Does it even make sense to stem, abbreviate, synonym for a few letters?
It will be so unpredictable.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:

> Assume, for example, that the postgres mailing list archive search used
> tsearch (which I think it does, but I'm not sure). It'd then probably make
> sense to add "postgres" to the list of stopwords, because it's bound to 
> appear in nearly every mail. But wouldn't you want searched which include
> 'postgres*' to turn up empty? Quite certainly not.

That improves recall for "postgres:*" query and certainly doesn't help
other queries like "post:*". But more importantly it affects precision
for all queries like "a:*", "an:*", "and:*", "s:*", 't:*', "the:*", etc
(When that is the only search it also affects recall as no row matches
an empty tsquery). Since stopwords are smaller, it means prefix search
for a few characters is meaningless. And I would argue that is when the
prefix search is more important -- only when you know a few characters.


-Sushant.






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a tsearch issue

2011-11-06 Thread Sushant Sinha
On Fri, 2011-11-04 at 11:22 +0100, Pavel Stehule wrote:
> Hello
> 
> I found a interesting issue when I checked a tsearch prefix searching.
> 
> We use a ispell based dictionary
> 
> CREATE TEXT SEARCH DICTIONARY cspell
>(template=ispell, dictfile = czech, afffile=czech, stopwords=czech);
> CREATE TEXT SEARCH CONFIGURATION cs (copy=english);
> ALTER TEXT SEARCH CONFIGURATION cs
>ALTER MAPPING FOR word, asciiword WITH cspell, simple;
> 
> Then I created a table
> 
> postgres=# create table n(a varchar);
> CREATE TABLE
> postgres=# insert into n values('Stěhule'),('Chromečka');
> INSERT 0 2
> postgres=# select * from n;
>  a
> ───
>  Stěhule
>  Chromečka
> (2 rows)
> 
> and I tested a prefix searching:
> 
> I found a following issue
> 
> postgres=# select * from n where to_tsvector('cs', a) @@
> to_tsquery('cs','Stě:*') ;
>  a
> ───
> (0 rows)

Most likely you are hit by this problem.
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01347.php

'Stě' may be a stopword in czech.

> I expected one row. The problem is in transformation of word 'Stě'
> 
> postgres=# select * from ts_debug('cs','Stě:*') ;
> ─[ RECORD 1 ]┬──
> alias│ word
> description  │ Word, all letters
> token│ Stě
> dictionaries │ {cspell,simple}
> dictionary   │ cspell
> lexemes  │ {sto}
> ─[ RECORD 2 ]┼──
> alias│ blank
> description  │ Space symbols
> token│ :*
> dictionaries │ {}
> dictionary   │ [null]
> lexemes  │ [null]
> 

':*' is only specific to to_tsquery. ts_debug just invokes the parser.
So this is not correct.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-11-08 Thread Sushant Sinha
I think there is a need to provide prefix search to bypass
dictionaries.If you folks think that there is some credibility to such a
need then I can think about implementing it. How about an operator like
":#" that does this? The ":*" will continue to mean the same as
currently.

-Sushant.

On Tue, 2011-10-25 at 23:45 +0530, Sushant Sinha wrote:
> On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:
> 
> > Assume, for example, that the postgres mailing list archive search used
> > tsearch (which I think it does, but I'm not sure). It'd then probably make
> > sense to add "postgres" to the list of stopwords, because it's bound to 
> > appear in nearly every mail. But wouldn't you want searched which include
> > 'postgres*' to turn up empty? Quite certainly not.
> 
> That improves recall for "postgres:*" query and certainly doesn't help
> other queries like "post:*". But more importantly it affects precision
> for all queries like "a:*", "an:*", "and:*", "s:*", 't:*', "the:*", etc
> (When that is the only search it also affects recall as no row matches
> an empty tsquery). Since stopwords are smaller, it means prefix search
> for a few characters is meaningless. And I would argue that is when the
> prefix search is more important -- only when you know a few characters.
> 
> 
> -Sushant



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries

2011-12-19 Thread Sushant Sinha
I recently upgraded my postgres server from 9.0 to 9.1.2 and I am
finding a peculiar problem.I have a program that periodically adds rows
to this table using INSERT. Typically the number of rows is just 1-2
thousand when the table already has 500K rows. Whenever the program is
adding rows, the performance of the search query on the same table is
very bad. The query uses the gin index and the tsearch ranking function
ts_rank_cd. 


This never happened earlier with postgres 9.0 Is there a known issue
with Postgres 9.1? Or how to report this problem?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries

2011-12-19 Thread Sushant Sinha
On Mon, 2011-12-19 at 19:08 +0200, Marti Raudsepp wrote:
> Another thought -- have you read about the GIN "fast updates" feature?
> This existed in 9.0 too. Instead of updating the index directly, GIN
> appends all changes to a sequential list, which needs to be scanned in
> whole for read queries. The periodic autovacuum process has to merge
> these values back into the index.
> 
> Maybe the solution is to tune autovacuum to run more often on the
> table.
> 
> http://www.postgresql.org/docs/9.1/static/gin-implementation.html
> 
> Regards,
> Marti 

Probably this is the problem. Is running "vacuum analyze" under psql is
the same as "autovacuum"?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries

2011-12-19 Thread Sushant Sinha
On Mon, 2011-12-19 at 12:41 -0300, Euler Taveira de Oliveira wrote:
> On 19-12-2011 12:30, Sushant Sinha wrote:
> > I recently upgraded my postgres server from 9.0 to 9.1.2 and I am
> > finding a peculiar problem.I have a program that periodically adds
> rows
> > to this table using INSERT. Typically the number of rows is just 1-2
> > thousand when the table already has 500K rows. Whenever the program
> is
> > adding rows, the performance of the search query on the same table
> is
> > very bad. The query uses the gin index and the tsearch ranking
> function
> > ts_rank_cd. 
> > 
> How bad is bad? It seems you are suffering from don't-fit-on-cache
> problem, no? 

The memory is 32GB and the entire database is just 22GB. Even "vmstat 1"
does not show any disk activity. 

I was not able to isolate the performance numbers since I have observed
this only on the production box where the number of requests keep
increasing as the box gets loaded. But a query that takes 1sec normally
is taking more than 10secs (not sure whether it got the same number of
CPU cycles). Is there a way to find that?

-Sushant.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] lexeme ordering in tsvector

2009-11-30 Thread Sushant Sinha
It seems like the ordering of lexemes in tsvector has changed from 8.3
to 8.4.

For example in 8.3.1,

postgres=# select to_tsvector('english', 'quit everytime');
  to_tsvector  
---
 'quit':1 'everytim':2

The lexemes are arranged by length and then by string comparison.

In postgres 8.4.1,

select to_tsvector('english', 'quit everytime');
  to_tsvector  
---
 'everytim':2 'quit':1

they are arranged by strncmp and then by length.

I looked in tsvector_op.c, in the function tsCompareString, first memcmp
and then length comparison is done.

Was this change in ordering deliberate?

Wouldn't length comparison be cheaper than memcmp?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-14 Thread Sushant Sinha
Attached a new patch that:

1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords.
Basically it divides a cover greater than MaxWords into fragments of
MaxWords, resizes each such fragment so that each end of the fragment
contains a query word and then evaluates best fragments based on number of
query words in each fragment. In case of tie it picks up the smaller
fragment. This allows more query words to be shown with multiple fragments
in case a single cover is larger than the MaxWords.

The resizing of a  fragment such that each end is a query word provides room
for stretching both sides of the fragment. This (hopefully) better presents
the context in which query words appear in the document. If a cover is
smaller than MaxWords then the cover is treated as a fragment.

Let me know if you have any more suggestions or anything is not clear.

I have not yet added the regression tests. The regression test suite seemed
to be only ensuring that the function works. How many tests should I be
adding? Is there any other place that I need to add different test cases for
the function?

-Sushant.


Nice. But it will be good to resolve following issues:
> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gzand
>  load in db.
>
> Queries
> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
>
> and
>
> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> from apod;
>
> crash postgresql :(
>
> 2) pls, include in your patch documentation and regression tests.
>
>
>> Another change that I was thinking:
>>
>> Right now if cover size > max_words then I just cut the trailing words.
>> Instead I was thinking that we should split the cover into more
>> fragments such that each fragment contains a few query words. Then each
>> fragment will not contain all query words but will show more occurrences
>> of query words in the headline. I would  like to know what your opinion
>> on this is.
>>
>
> Agreed.
>
>
> --
> Teodor Sigaev   E-mail: [EMAIL PROTECTED]
>   WWW:
> http://www.sigaev.ru/
>
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.15
diff -c -r1.15 wparser_def.c
*** src/backend/tsearch/wparser_def.c	17 Jun 2008 16:09:06 -	1.15
--- src/backend/tsearch/wparser_def.c	15 Jul 2008 04:30:34 -
***
*** 1684,1701 
  	return false;
  }
  
! Datum
! prsd_headline(PG_FUNCTION_ARGS)
  {
! 	HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0);
! 	List	   *prsoptions = (List *) PG_GETARG_POINTER(1);
! 	TSQuery		query = PG_GETARG_TSQUERY(2);
  
! 	/* from opt + start and and tag */
! 	int			min_words = 15;
! 	int			max_words = 35;
! 	int			shortword = 3;
  
  	int			p = 0,
  q = 0;
  	int			bestb = -1,
--- 1684,1944 
  	return false;
  }
  
! static void 
! mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos)
  {
! 	int   i;
! 	char *coversep = "... ";
!	int   seplen   = strlen(coversep);
  
! 	for (i = startpos; i <= endpos; i++)
! 	{
! 		if (prs->words[i].item)
! 			prs->words[i].selected = 1;
! 		if (highlight == 0)
! 		{
! 			if (HLIDIGNORE(prs->words[i].type))
! prs->words[i].replace = 1;
! 		}
! 		else
! 		{
! 			if (XMLHLIDIGNORE(prs->words[i].type))
! prs->words[i].replace = 1;
! 		}
! 
! 		prs->words[i].in = (prs->words[i].repeated) ? 0 : 1;
! 	}
! 	/* add cover separators if needed */ 
! 	if (startpos > 0)
! 	{
! 		
! 		prs->words[startpos-1].word = repalloc(prs->words[startpos-1].word, sizeof(char) * seplen);
! 		prs->words[startpos-1].in   = 1;
! 		prs->words[startpos-1].len  = seplen;
! 		memcpy(prs->words[startpos-1].word, coversep, seplen);
! 	}
! }
! 
! typedef struct 
! {
! 	int4 startpos;
! 	int4 endpos;
! 	int4 poslen;
! 	int4 curlen;
! 	int2 in;
! 	int2 excluded;
! } CoverPos;
! 
! static void 
! get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos,
! 			int *curlen, int *poslen, int max_words)
! {
! 	int i;
! 	/* Objective: Generate a fragment of words between startpos and endpos 
! 	 * such that it has at most max_words and both ends has query words. 
! 	 * If the startpos and endpos are the endpoints of the cover and the 
! 	 * cover has fewer words than max_words, then this function should 
! 	 * just return the cover 
! 	 */
! 	/* first move startpos to an item */
! 	for(i = *startpos; i <= *endpos; i++)
! 	{
! 		*startpos = i;
! 		if (prs->words[i].item && !prs->words[i].repeated)
! 			break;
! 	}
! 	/* cut endpos to have only max_words */
! 	*curlen = 0;
! 	*poslen = 0;
!

Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-15 Thread Sushant Sinha
attached are two patches:

1. documentation
2. regression tests
 for headline with fragments.

-Sushant.

On Tue, 2008-07-15 at 13:29 +0400, Teodor Sigaev wrote:
> > Attached a new patch that:
> > 
> > 1. fixes previous bug
> > 2. better handles the case when cover size is greater than the MaxWords. 
> 
> Looks good, I'll make some tests with  real-world application.
> 
> > I have not yet added the regression tests. The regression test suite 
> > seemed to be only ensuring that the function works. How many tests 
> > should I be adding? Is there any other place that I need to add 
> > different test cases for the function?
> 
> Just add 3-5 selects to src/test/regress/sql/tsearch.sql with checking basic 
> functionality and corner cases like
>   - there is no covers in text
>   - Cover(s) is too big
>   - and so on
> 
> Add some words in documentation too, pls.
> 
> 
Index: doc/src/sgml/textsearch.sgml
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.44
diff -c -r1.44 textsearch.sgml
*** doc/src/sgml/textsearch.sgml	16 May 2008 16:31:01 -	1.44
--- doc/src/sgml/textsearch.sgml	16 Jul 2008 02:37:28 -
***
*** 1100,1105 
--- 1100,1117 
   
   

+MaxFragments: maximum number of text excerpts 
+or fragments that matches the query words. It also triggers a 
+different headline generation function than the default one. This
+function finds text fragments with as many query words as possible.
+Each fragment will be of at most MaxWords and will not have words
+of size less than or equal to ShortWord at the start or end of a 
+fragment. If all query words are not found in the document, then
+a single fragment of MinWords will be displayed.
+   
+  
+  
+   
 HighlightAll: Boolean flag;  if
 true the whole document will be highlighted.

***
*** 1109,1115 
  Any unspecified options receive these defaults:
  
  
! StartSel=, StopSel=, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
  
 
  
--- 1121,1127 
  Any unspecified options receive these defaults:
  
  
! StartSel=, StopSel=, MaxFragments=0, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
  
 
  
Index: src/test/regress/sql/tsearch.sql
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/test/regress/sql/tsearch.sql,v
retrieving revision 1.9
diff -c -r1.9 tsearch.sql
*** src/test/regress/sql/tsearch.sql	16 May 2008 16:31:02 -	1.9
--- src/test/regress/sql/tsearch.sql	16 Jul 2008 03:45:24 -
***
*** 208,213 
--- 208,253 
  ',
  to_tsquery('english', 'sea&foo'), 'HighlightAll=true');
  
+ --Check if headline fragments work 
+ SELECT ts_headline('english', '
+ Day after day, day after day,
+   We stuck, nor breath nor motion,
+ As idle as a painted Ship
+   Upon a painted Ocean.
+ Water, water, every where
+   And all the boards did shrink;
+ Water, water, every where,
+   Nor any drop to drink.
+ S. T. Coleridge (1772-1834)
+ ', to_tsquery('english', 'ocean'), 'MaxFragments=1');
+ 
+ --Check if more than one fragments are displayed
+ SELECT ts_headline('english', '
+ Day after day, day after day,
+   We stuck, nor breath nor motion,
+ As idle as a painted Ship
+   Upon a painted Ocean.
+ Water, water, every where
+   And all the boards did shrink;
+ Water, water, every where,
+   Nor any drop to drink.
+ S. T. Coleridge (1772-1834)
+ ', to_tsquery('english', 'Coleridge & stuck'), 'MaxFragments=2');
+ 
+ --Fragments when there all query words are not in the document
+ SELECT ts_headline('english', '
+ Day after day, day after day,
+   We stuck, nor breath nor motion,
+ As idle as a painted Ship
+   Upon a painted Ocean.
+ Water, water, every where
+   And all the boards did shrink;
+ Water, water, every where,
+   Nor any drop to drink.
+ S. T. Coleridge (1772-1834)
+ ', to_tsquery('english', 'ocean & seahorse'), 'MaxFragments=1');
+ 
+ 
  --Rewrite sub system
  
  CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT);
Index: src/test/regress/expected/tsearch.out
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/test/regress/expected/tsearch.out,v
retrieving revision 1.14
diff -c -r1.14 tsearch.out
*** src/test/regress/expected/tsearch.out	16 May 2008 16:31:02 -	1.14
--- src/test/regress/expected/tsearch.out	16 Jul 2008 03:47:46 -
***
*** 632,637 
--- 632,705 
   
  (1 row)
  
+ --Check if headline fragments work 
+ SELECT ts_headline('english', '
+ Day after day, day after day,
+   We stuck, nor breath nor motion,
+ As idle as a painted Ship
+   Upon a painted Ocean.
+ Water, water, every where
+   And all the boards did shrink;
+ Water, water,

Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-16 Thread Sushant Sinha
I will add test queries and their results for the corner cases in a
separate file. I guess the only thing I am confused about is what should
be the behavior of headline generation when Query items have words of
size less than ShortWord. I guess the answer is to ignore ShortWord
parameter but let me know if the answer is any different.

-Sushant.
 
On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote:
> Sushant,
> 
> first, please, provide simple test queries, which demonstrate the right work
> in the corner cases. This will helps reviewers to test your patch and
> helps you to make sure your new version is ok. For example:
> 
> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery);
>   ts_headline
> --
>   1 2 3 4 5 1 2 3 1
> 
> This select breaks your code:
> 
> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2');
>   ts_headline
> --
>   ...  2 ...
> 
> and so on 
> 
> 
> Oleg
> On Tue, 15 Jul 2008, Sushant Sinha wrote:
> 
> > Attached a new patch that:
> >
> > 1. fixes previous bug
> > 2. better handles the case when cover size is greater than the MaxWords.
> > Basically it divides a cover greater than MaxWords into fragments of
> > MaxWords, resizes each such fragment so that each end of the fragment
> > contains a query word and then evaluates best fragments based on number of
> > query words in each fragment. In case of tie it picks up the smaller
> > fragment. This allows more query words to be shown with multiple fragments
> > in case a single cover is larger than the MaxWords.
> >
> > The resizing of a  fragment such that each end is a query word provides room
> > for stretching both sides of the fragment. This (hopefully) better presents
> > the context in which query words appear in the document. If a cover is
> > smaller than MaxWords then the cover is treated as a fragment.
> >
> > Let me know if you have any more suggestions or anything is not clear.
> >
> > I have not yet added the regression tests. The regression test suite seemed
> > to be only ensuring that the function works. How many tests should I be
> > adding? Is there any other place that I need to add different test cases for
> > the function?
> >
> > -Sushant.
> >
> >
> > Nice. But it will be good to resolve following issues:
> >> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
> >> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and
> >>  load in db.
> >>
> >> Queries
> >> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> >> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
> >>
> >> and
> >>
> >> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> >> from apod;
> >>
> >> crash postgresql :(
> >>
> >> 2) pls, include in your patch documentation and regression tests.
> >>
> >>
> >>> Another change that I was thinking:
> >>>
> >>> Right now if cover size > max_words then I just cut the trailing words.
> >>> Instead I was thinking that we should split the cover into more
> >>> fragments such that each fragment contains a few query words. Then each
> >>> fragment will not contain all query words but will show more occurrences
> >>> of query words in the headline. I would  like to know what your opinion
> >>> on this is.
> >>>
> >>
> >> Agreed.
> >>
> >>
> >> --
> >> Teodor Sigaev   E-mail: [EMAIL PROTECTED]
> >>   WWW:
> >> http://www.sigaev.ru/
> >>
> >
> 
>   Regards,
>   Oleg
> _
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] small bug in hlCover

2008-07-16 Thread Sushant Sinha
I think there is a slight bug in hlCover function in wparser_def.c

If there is only one query item and that is the first word in the text,
then hlCover does not returns any cover. This is evident in this example
when ts_headline only generates the min_words:

testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery,
'MinWords=5');
   ts_headline
--
 1 2 3 4 5
(1 row)

The problem is that *q is initialized to 0 which is a legitimate value
for a cover. So I have attached a patch that fixes it and after applying
the patch here is the result.

testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery,
'MinWords=5');
 ts_headline 
-
 1 2 3 4 5 6 7 8 9 10
(1 row)

-Sushant.
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.15
diff -c -r1.15 wparser_def.c
*** src/backend/tsearch/wparser_def.c	17 Jun 2008 16:09:06 -	1.15
--- src/backend/tsearch/wparser_def.c	17 Jul 2008 02:45:34 -
***
*** 1621,1627 
  	QueryItem  *item = GETQUERY(query);
  	int			pos = *p;
  
! 	*q = 0;
  	*p = 0x7fff;
  
  	for (j = 0; j < query->size; j++)
--- 1621,1627 
  	QueryItem  *item = GETQUERY(query);
  	int			pos = *p;
  
! 	*q = -1;
  	*p = 0x7fff;
  
  	for (j = 0; j < query->size; j++)
***
*** 1643,1649 
  		item++;
  	}
  
! 	if (*q == 0)
  		return false;
  
  	item = GETQUERY(query);
--- 1643,1649 
  		item++;
  	}
  
! 	if (*q < 0)
  		return false;
  
  	item = GETQUERY(query);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-17 Thread Sushant Sinha
Fixed some off by one errors pointed by Oleg and errors in excluding
overlapping fragments.
 
Also adding test queries and updating regression tests.

Let me know of any other changes that are needed.

-Sushant.



On Thu, 2008-07-17 at 03:28 +0400, Oleg Bartunov wrote:
> On Wed, 16 Jul 2008, Sushant Sinha wrote:
> 
> > I will add test queries and their results for the corner cases in a
> > separate file. I guess the only thing I am confused about is what should
> > be the behavior of headline generation when Query items have words of
> > size less than ShortWord. I guess the answer is to ignore ShortWord
> > parameter but let me know if the answer is any different.
> >
> 
> ShortWord is about headline text, it doesn't affects words in query,
> so you can't discard them from query.
> 
> > -Sushant.
> >
> > On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote:
> >> Sushant,
> >>
> >> first, please, provide simple test queries, which demonstrate the right 
> >> work
> >> in the corner cases. This will helps reviewers to test your patch and
> >> helps you to make sure your new version is ok. For example:
> >>
> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery);
> >>   ts_headline
> >> --
> >>   1 2 3 4 5 1 2 3 1
> >>
> >> This select breaks your code:
> >>
> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2');
> >>   ts_headline
> >> --
> >>   ...  2 ...
> >>
> >> and so on 
> >>
> >>
> >> Oleg
> >> On Tue, 15 Jul 2008, Sushant Sinha wrote:
> >>
> >>> Attached a new patch that:
> >>>
> >>> 1. fixes previous bug
> >>> 2. better handles the case when cover size is greater than the MaxWords.
> >>> Basically it divides a cover greater than MaxWords into fragments of
> >>> MaxWords, resizes each such fragment so that each end of the fragment
> >>> contains a query word and then evaluates best fragments based on number of
> >>> query words in each fragment. In case of tie it picks up the smaller
> >>> fragment. This allows more query words to be shown with multiple fragments
> >>> in case a single cover is larger than the MaxWords.
> >>>
> >>> The resizing of a  fragment such that each end is a query word provides 
> >>> room
> >>> for stretching both sides of the fragment. This (hopefully) better 
> >>> presents
> >>> the context in which query words appear in the document. If a cover is
> >>> smaller than MaxWords then the cover is treated as a fragment.
> >>>
> >>> Let me know if you have any more suggestions or anything is not clear.
> >>>
> >>> I have not yet added the regression tests. The regression test suite 
> >>> seemed
> >>> to be only ensuring that the function works. How many tests should I be
> >>> adding? Is there any other place that I need to add different test cases 
> >>> for
> >>> the function?
> >>>
> >>> -Sushant.
> >>>
> >>>
> >>> Nice. But it will be good to resolve following issues:
> >>>> 1) Patch contains mistakes, I didn't investigate or carefully read it. 
> >>>> Get
> >>>> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and
> >>>>  load in db.
> >>>>
> >>>> Queries
> >>>> # select ts_headline(body, plainto_tsquery('black hole'), 
> >>>> 'MaxFragments=1')
> >>>> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
> >>>>
> >>>> and
> >>>>
> >>>> # select ts_headline(body, plainto_tsquery('black hole'), 
> >>>> 'MaxFragments=1')
> >>>> from apod;
> >>>>
> >>>> crash postgresql :(
> >>>>
> >>>> 2) pls, include in your patch documentation and regression tests.
> >>>>
> >>>>
> >>>>> Another change that I was thinking:
> >>>>>
> >>>>> Right now if cover size > max_words then I just cut the trailing words.
> >>>>> Inste

Re: [HACKERS] phrase search

2008-07-18 Thread Sushant Sinha
I looked at query operators for tsquery and here are some of the new
query operators for position based queries. I am just proposing some
changes and the questions I have.

1. What is the meaning of such a query operator?

foo #5 bar -> true if the document has word "foo" followed by "bar" at
5th position.
   
foo #<5 bar -> true if document has word "foo" followed by "bar" with in
5 positions

foo #>5 bar -> true if document has word "foo" followed by "bar" after 5
positions

then some other ways it can be used are
!(foo #<5 bar) -> true if document never has any "foo"  followed by bar
with in 5 positions.

etc .

2. How to implement such query operators?

Should we modify QueryItem to include additional distance information or
is there any other way to accomplish it?

Is the following list sufficient to accomplish this?
a. Modify to_tsquery
b. Modify TS_execute in tsvector_op.c to check new operator

Is there anything needed in rewrite subsystem?

3. Are these valid uses of the operators and if yes what would they
mean?

foo #5 (bar & cup)

If no then should the operator be applied to only two QI_VAL's?

4. If the operator only applies to two query items can we create an
index such that (foo, bar)-> documents[min distance, max distance] 
How difficult it is to implement an index like this?


Thanks,
-Sushant.

On Thu, 2008-06-05 at 19:37 +0400, Teodor Sigaev wrote:
> > I can add index support and support for arbitrary distance between
> > lexeme. 
> > It appears to me that supporting arbitrary boolean expression will be
> > complicated. Can we pull out something from TSQuery?
> 
> I don't very like an idea to have separated interface for phrase search. Your 
> patch may be a module and used by people who really wants to have a phrase 
> search.
> 
> Introducing new operator in tsquery allows to use already existing 
> infrastructure of tsquery such as concatenations (&&, ||, !!), rewrite 
> subsystem 
> etc.  But new operation/types specially designed for phrase search makes 
> needing 
> to make that work again.
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-23 Thread Sushant Sinha
I guess it is more readable to add cover separator at the end of a fragment
than in the front. Let me know what you think and I can update it.

I think the right place for cover separator is in the structure
HeadlineParsedText just like startsel and stopsel. This will enable users to
specify their own cover separators. But this will require changes to the
structure as well as to the generateHeadline function. This option will not
also play well with the default headline generation function.

The default MaxWords = 35 seems a bit high for this headline generation
function and 20 seems to be more reasonable. Any thoughts?

-Sushant.

On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov <[EMAIL PROTECTED]> wrote:

> btw, is it intentional to have '' in headline ?
>
> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1');
>   ts_headline
> -
>  ... 4 5 1
>
>
>
> Oleg
>
> On Wed, 23 Jul 2008, Teodor Sigaev wrote:
>
>  Let me know of any other changes that are needed.
>>>
>>
>> Looks like ready to commit, but documentation is needed.
>>
>>
>>
>Regards,
>Oleg
> _
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: [EMAIL PROTECTED], 
> http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-08-02 Thread Sushant Sinha
Sorry for the delay. Here is the patch with FragmentDelimiter option. 
It requires an extra option in HeadlineParsedText and uses that option
during generateHeadline.

Implementing notion of fragments in HeadlineParsedText and a separate
function to join them seems more complicated. So for the time being I
just dump a FragmentDelimiter whenever a new fragment (other than the
first one) starts.

The patch also contains the updated regression tests/results and also a
new test for FragmentDelimiter option. It also contains the
documentation for the new options.

I have also attached a separate file that tests different aspects of the
new headline generation function.

Let me know if anything else is needed.

-Sushant.

On Thu, 2008-07-24 at 00:28 +0400, Oleg Bartunov wrote:
> On Wed, 23 Jul 2008, Sushant Sinha wrote:
> 
> > I guess it is more readable to add cover separator at the end of a fragment
> > than in the front. Let me know what you think and I can update it.
> 
> FragmentsDelimiter should *separate* fragments and that says all. 
> Not very difficult algorithmic problem, it's like  perl's
> join(FragmentsDelimiter, @array)
> 
> >
> > I think the right place for cover separator is in the structure
> > HeadlineParsedText just like startsel and stopsel. This will enable users to
> > specify their own cover separators. But this will require changes to the
> > structure as well as to the generateHeadline function. This option will not
> > also play well with the default headline generation function.
> 
> As soon as we introduce FragmentsDelimiter we should make it
> configurable.
> 
> >
> > The default MaxWords = 35 seems a bit high for this headline generation
> > function and 20 seems to be more reasonable. Any thoughts?
> 
> I think we should not change default value because it could change
> behaviour of existing applications. I'm not sure if it'd be useful and
> possible to define default values in CREATE TEXT SEARCH PARSER
> 
> >
> > -Sushant.
> >
> > On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov <[EMAIL PROTECTED]> wrote:
> >
> >> btw, is it intentional to have '' in headline ?
> >>
> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1');
> >>   ts_headline
> >> -
> >>  ... 4 5 1
> >>
> >>
> >>
> >> Oleg
> >>
> >> On Wed, 23 Jul 2008, Teodor Sigaev wrote:
> >>
> >>  Let me know of any other changes that are needed.
> >>>>
> >>>
> >>> Looks like ready to commit, but documentation is needed.
> >>>
> >>>
> >>>
> >>Regards,
> >>Oleg
> >> _
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: [EMAIL PROTECTED], 
> >> http://www.sai.msu.su/~megera/<http://www.sai.msu.su/%7Emegera/>
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >>
> >
> 
>   Regards,
>   Oleg
> _
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
Index: src/include/tsearch/ts_public.h
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/include/tsearch/ts_public.h,v
retrieving revision 1.10
diff -c -r1.10 ts_public.h
*** src/include/tsearch/ts_public.h	18 Jun 2008 18:42:54 -	1.10
--- src/include/tsearch/ts_public.h	2 Aug 2008 02:40:27 -
***
*** 52,59 
--- 52,61 
  	int4		curwords;
  	char	   *startsel;
  	char	   *stopsel;
+ 	char	   *fragdelim;
  	int2		startsellen;
  	int2		stopsellen;
+ 	int2		fragdelimlen; 
  } HeadlineParsedText;
  
  /*
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.15
diff -c -r1.15 wparser_def.c
*** src/backend/tsearch/wparser_def.c	17 Jun 2008 16:09:06 -	1.15
--- src/backend/tsearch/wparser_def.c	2 Aug 2008 15:25:46 -
***
*** 1684,1701 
  	return false;
  }
  
! Datum
! prsd_headline(PG_FUNCTION_ARGS)
  {
! 	HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0);
! 	Li

Re: [HACKERS] small bug in hlCover

2008-08-03 Thread Sushant Sinha
Has any one noticed this?

-Sushant.

On Wed, 2008-07-16 at 23:01 -0400, Sushant Sinha wrote:
> I think there is a slight bug in hlCover function in wparser_def.c
> 
> If there is only one query item and that is the first word in the text,
> then hlCover does not returns any cover. This is evident in this example
> when ts_headline only generates the min_words:
> 
> testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery,
> 'MinWords=5');
>ts_headline
> --
>  1 2 3 4 5
> (1 row)
> 
> The problem is that *q is initialized to 0 which is a legitimate value
> for a cover. So I have attached a patch that fixes it and after applying
> the patch here is the result.
> 
> testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery,
> 'MinWords=5');
>  ts_headline 
> -
>  1 2 3 4 5 6 7 8 9 10
> (1 row)
> 
> -Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small bug in hlCover

2008-08-03 Thread Sushant Sinha
On Mon, 2008-08-04 at 00:36 -0300, Euler Taveira de Oliveira wrote:
> Sushant Sinha escreveu:
> > I think there is a slight bug in hlCover function in wparser_def.c
> > 
> The bug is not in the hlCover. In prsd_headline, if we didn't find a 
> suitable bestlen (i.e. >= 0), than it includes up to document length or 
> *maxWords* (here is the bug). I'm attaching a small patch that fixes it 
> and some comment typos. Please apply it to 8_3_STABLE too.

Well hlCover purpose is to find a cover and for the document  '1 2 3 4 5
6 7 8 9 10' and the query '1'::tsquery, a cover exists. So it should
point it out.

On my source I see that prsd_headline marks only min_words which seems
like the right thing to do.

-Sushant.

> 
> euler=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery, 
> 'MinWords=5');
>   ts_headline
> -
>   1 2 3 4 5 6 7 8 9 10
> (1 registro)
> 
> euler=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery);
>   ts_headline
> -
>   1 2 3 4 5 6 7 8 9 10
> (1 registro)
> 
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] text search patch status update?

2008-09-15 Thread Sushant Sinha
Any status updates on the following patches?

1. Fragments in tsearch2 headlines:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php

2. Bug in hlCover:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search patch status update?

2008-09-16 Thread Sushant Sinha
Patch #1. Teodor was fine with the previous version of the patch. After that
I modified it slightly to allow a FragmentDelimiter option and Teodor may
have to look at that.

Patch #2. I think this is a straigt forward bug fix.

-Sushant.

On Tue, Sep 16, 2008 at 11:27 AM, Alvaro Herrera <[EMAIL PROTECTED]
> wrote:

> Sushant Sinha escribió:
> > Any status updates on the following patches?
> >
> > 1. Fragments in tsearch2 headlines:
> > http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php
> >
> > 2. Bug in hlCover:
> > http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php
>
> Are these ready for review?  If so, please add them to this commitfest,
> http://wiki.postgresql.org/wiki/CommitFest:2008-09
>
> --
> Alvaro Herrera
> http://www.CommandPrompt.com/
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>


Re: [HACKERS] text search patch status update?

2009-01-07 Thread Sushant Sinha
The default headline generation function is complicated. It checks a lot
of cases to determine the best headline to be displayed. So Heikki's
examples just say that headline generation function may not be very
intuitive. However, his examples were not affected by the bug.

Because of the bug, hlcover was not returning a cover when the query
item was the first lexeme in the text. And so the headline generation
function will return just MINWORDS rather than the actual headline as
per the logic.

After the patch you will see the difference in the example:

http://archives.postgresql.org/pgsql-hackers/2008-07/msg00785.php

-Sushant.

On Wed, 2009-01-07 at 20:50 -0500, Bruce Momjian wrote:
> Uh, where are we on this?  I see the same output in CVS HEAD as Heikki,
> and I assume he thought at least one of them was wrong.  ;-)
> 
> ---
> 
> Heikki Linnakangas wrote:
> > Sushant Sinha wrote:
> > > Patch #2. I think this is a straigt forward bug fix.
> > 
> > Yes, I think you're right. In hlCover(), *q is 0 when the only match is 
> > the first item in the text, and we shouldn't bail out with "return 
> > false" in that case.
> > 
> > But there seems to be something else going on here as well:
> > 
> > postgres=# select ts_headline('1 2 3 4 5', '2'::tsquery, 'MinWords=2, 
> > MaxWords=3');
> >   ts_headline
> > --
> >   2 3 4
> > (1 row)
> > 
> > postgres=# select ts_headline('aaa1 aaa2 aaa3 aaa4 
> > aaa5','aaa2'::tsquery, 'MinWords=2, MaxWords=3');
> > ts_headline
> > --
> >   aaa2 aaa3
> > (1 row)
> > 
> > In the first example, you get three words, and in the 2nd, just two. It 
> > must be because of the default ShortWord setting of 3. Also, if only the 
> > last word matches, and it's a "short word", you get the whole text:
> > 
> > postgres=# select ts_headline('1 2 3 4 5','5'::tsquery, 'MinWords=2, 
> > MaxWords=3');
> > ts_headline
> > --
> >   1 2 3 4 5
> > (1 row)
> > 
> > -- 
> >Heikki Linnakangas
> >EnterpriseDB   http://www.enterprisedb.com
> > 
> > -- 
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Very bad FTS performance with the Polish config

2009-11-18 Thread Sushant Sinha
ts_headline calls ts_lexize equivalent to break the text. Off course there
is algorithm to process the tokens and generate the headline. I would be
really surprised if the algorithm to generate the headline is somehow
dependent on language (as it only processes the tokens). So Oleg is right
when he says ts_lexize is something to be checked.

I will try to replicate what you are trying to do but in the meantime can
you run the same ts_headline under psql multiple times and paste the result.

-Sushant.

2009/11/19 Wojciech Knapik 

>
> Oleg Bartunov wrote:
>
>  Yes, for 4-word texts the results are similar.
>>> Try that with a longer text and the difference becomes more and more
>>> significant. For the lorem ipsum text, 'polish' is about 4 times slower,
>>> than 'english'. For 5 repetitions of the text, it's 6 times, for 10
>>> repetitions - 7.5 times...
>>>
>>
>> Again, I see nothing unclear here, since dictionaries (as specified
>> in configuration) apply to ALL words in document. The more words in
>> document, the more overhead.
>>
>
> You're missing the point. I'm not surprised that the function takes more
> time for larger input texts - that's obvious. The thing is, the computation
> times rise more steeply when the Polish config is used. Steeply enough, that
> the difference between the Polish and English configs becomes enormous in
> practical cases.
>
> Now this may be expected behaviour, but since I don't know if it is, I
> posted to the mailing lists to find out. If you're saying this is ok and
> there's nothing to fix here, then there's nothing more to discuss and we may
> consider the thread closed.
> If not, ts_headline deserves a closer look.
>
> cheers,
> Wojciech Knapik
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-05-24 Thread Sushant Sinha
Now I understand the code much better. A few more questions on headline
generation that I was not able to get from the code:

1. Why is hlparsetext used to parse the document rather than the
parsetext function? Since  words to be included in the headline will be
marked afterwords, it seems more reasonable to just use the parsetext
function.

The main difference I see is the use of hlfinditem and marking whether
some word is repeated.

The reason this is important is that hlparsetext does not seem to be
storing word positions which parsetext does. The word positions are
important for generating headline with fragments.

2.
> I would prefer the signature ts_headline( [regconfig,] text, tsquery
>[,text] )and function should accept 'NumFragments=>N' for default
>parser. Another parsers may use another options.

Does this mean we want a unified function ts_headline and we trigger the
fragments if NumFragments is specified? It seems that introducing a new
function which can take configuration OID, or name is complex as there
are so many functions handling these issues in wparser.c.

If this is true then we need to just  add marking of headline words in
prsd_headline. Otherwise we will need another prsd_headline_with_covers
function.

3. In many cases people may already have TSVector for a given document
(for search operation). Would it be faster to pass TSVector to headline
function when compared to computing TSVector each time? If that is the
case then should we have an option to pass TSVector to headline
function?

-Sushant.

On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote:
> [moved to -hackers, because talk is about implementation details]
> 
> > I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
> > (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php)
> Thank you.
> 
> 1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c
> now contrib/tsearch2 is compatibility layer for old applications - they don't
> know about new features. So, this part isn't needed.
> 
> 2 solution to compile function (ts_headline_with_fragments)  into core, but
> using it only from contrib module looks very odd. So, new feature can be used
> only with compatibility layer for old release :)
> 
> 3 headline_with_fragments() is hardcoded to use default parser, but what will 
> be
> in case when configuration uses another parser? For example, for japanese 
> language.
> 
> 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery 
> [,text] )
> and function should accept 'NumFragments=>N' for default parser. Another 
> parsers
> may use another options.
> 
> 5 it just doesn't work correctly, because new code doesn't care of parser
> specific type of lexemes.
> contrib_regression=# select headline_with_fragments('english', 'wow asd-wow
> wow', 'asd', '');
>   headline_with_fragments
> --
>   ...wow asd-wowasd-wow wow
> (1 row)
> 
> 
> So, I incline to use existing framework/infrastructure although it may be a
> subject to change.
> 
> Some description:
> 1 ts_headline defines a correct parser to use
> 2 it calls hlparsetext to split text into structure suitable for both goals:
> find the best fragment(s) and concatenate that fragment(s) back to the text
> representation
> 3 it calls parser specific method   prsheadline which works with preparsed 
> text
> (parse was done in hlparsetext). Method should mark a needed
> words/parts/lexemes etc.
> 4 ts_headline glues fragments into text and returns that.
> 
> We need a parser's headline method because only parser knows all about its 
> lexemes.
> 
> 
> -- 
> Teodor Sigaev   E-mail: [EMAIL PROTECTED]
> WWW: http://www.sigaev.ru/
> 
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-05-31 Thread Sushant Sinha
I have attached a new patch with respect to the current cvs head. This
produces headline in a document for a given query. Basically it
identifies fragments of text that contain the query and displays them.

DESCRIPTION

 HeadlineParsedText contains an array of  actual words but not
information about the norms. We need an indexed position vector for each
norm so that we can quickly evaluate a number of possible fragments.
Something that tsvector provides.

So this patch changes HeadlineParsedText to contain the norms
(ParsedText). This field is updated while parsing in hlparsetext. The
position information of the norms corresponds to the position of words
in HeadlineParsedText (not to the norms positions as is the case in
tsvector). This works correctly with the current parser. If you think
there may be issues with other parsers please let me know.

This approach does not change any other interface and fits nicely with
the overall framework.

The norms are converted into tsvector and a number of covers are
generated. The best covers are then chosen to be in the headline. The
covers are separated using a hardcoded coversep. Let me know if you want
to expose this as an option.

Covers that overlap with already chosen covers are excluded.

Some options like ShortWord and MinWords are not taken care of right
now. MaxWords are used as maxcoversize. Let me know if you would like to
see other options for fragment generation as well.

Let me know any more changes you would like to see.

-Sushant.

On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote:
> Hi!
> 
> > 1. Why is hlparsetext used to parse the document rather than the
> > parsetext function? Since  words to be included in the headline will be
> > marked afterwords, it seems more reasonable to just use the parsetext
> > function.
> > The main difference I see is the use of hlfinditem and marking whether
> > some word is repeated.
> hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext 
> doesn't.
> hlparsetext preserves original form of lexemes. parsetext doesn't.
> 
> > 
> > The reason this is important is that hlparsetext does not seem to be
> > storing word positions which parsetext does. The word positions are
> > important for generating headline with fragments.
> Doesn't needed - hlparsetext preserves the whole text, so, position is a 
> number 
> of array.
> 
> > 
> > 2.
> >> I would prefer the signature ts_headline( [regconfig,] text, tsquery
> >> [,text] )and function should accept 'NumFragments=>N' for default
> >> parser. Another parsers may use another options.
> > 
> > Does this mean we want a unified function ts_headline and we trigger the
> > fragments if NumFragments is specified? 
> 
> Trigger should be inside parser-specific function (pg_ts_parser.prsheadline). 
> Another parsers might not recognize that option.
> 
> > It seems that introducing a new
> > function which can take configuration OID, or name is complex as there
> > are so many functions handling these issues in wparser.c.
> No, of course - ts_headline takes care about finding configuration and 
> calling 
> correct parser.
> 
> > 
> > If this is true then we need to just  add marking of headline words in
> > prsd_headline. Otherwise we will need another prsd_headline_with_covers
> > function.
> Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can  
> change 
> an array of HeadlineParsedText.
> 
> > 
> > 3. In many cases people may already have TSVector for a given document
> > (for search operation). Would it be faster to pass TSVector to headline
> > function when compared to computing TSVector each time? If that is the
> > case then should we have an option to pass TSVector to headline
> > function?
> As I mentioned above, tsvector doesn;t contain whole information about text.
> 
Index: src/backend/tsearch/dict.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/dict.c,v
retrieving revision 1.5
diff -u -r1.5 dict.c
--- src/backend/tsearch/dict.c	25 Mar 2008 22:42:43 -	1.5
+++ src/backend/tsearch/dict.c	30 May 2008 23:20:57 -
@@ -16,6 +16,7 @@
 #include "catalog/pg_type.h"
 #include "tsearch/ts_cache.h"
 #include "tsearch/ts_utils.h"
+#include "tsearch/ts_public.h"
 #include "utils/builtins.h"
 
 
Index: src/backend/tsearch/to_tsany.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/to_tsany.c,v
retrieving revision 1.12
diff -u -r1.12 to_tsany.c
--- src/backend/tsearch/to_tsany.c	16 May 2008 16:31:01 -	1.12
+++ src/backend/tsearch/to_tsany.c	31 May 2008 08:43:27 -
@@ -15,6 +15,7 @@
 
 #include "catalog/namespace.h"
 #include "tsearch/ts_cache.h"
+#include "tsearch/ts_public.h"
 #include "tsearch/ts_utils.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
Index: src/backend/tsearch/ts_parse.c
==

[HACKERS] phrase search

2008-05-31 Thread Sushant Sinha
I have attached a patch for phrase search with respect to the cvs head.
Basically it takes a a phrase (text) and a TSVector. It checks if the
relative positions of lexeme in the phrase are same as in their
positions in TSVector.

If the configuration for text search is "simple", then this will produce
exact phrase search. Otherwise the stopwords in a phrase will be ignored
and the words in a phrase will only be matched with the stemmed lexeme.

For my application I am using this as a separate shared object. I do not
know how to expose this function from the core. Can someone explain how
to do this?

I saw this discussion on phrase search and I am not sure what other
functionality is required.

http://archives.postgresql.org/pgsql-general/2008-02/msg01170.php

-Sushant.
Index: src/backend/utils/adt/Makefile
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/utils/adt/Makefile,v
retrieving revision 1.69
diff -u -r1.69 Makefile
--- src/backend/utils/adt/Makefile	19 Feb 2008 10:30:08 -	1.69
+++ src/backend/utils/adt/Makefile	31 May 2008 19:57:34 -
@@ -29,7 +29,7 @@
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
 	tsvector.o tsvector_op.o tsvector_parser.o \
-	txid.o uuid.o xml.o
+	txid.o uuid.o xml.o phrase_search.o
 
 like.o: like.c like_match.c
 
Index: src/backend/utils/adt/phrase_search.c
===
RCS file: src/backend/utils/adt/phrase_search.c
diff -N src/backend/utils/adt/phrase_search.c
--- /dev/null	1 Jan 1970 00:00:00 -
+++ src/backend/utils/adt/phrase_search.c	31 May 2008 19:56:59 -
@@ -0,0 +1,167 @@
+#include "postgres.h"
+
+#include "tsearch/ts_type.h"
+#include "tsearch/ts_utils.h"
+
+#include "fmgr.h"
+
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
+
+PG_FUNCTION_INFO_V1(is_phrase_present);
+Datum is_phrase_present(PG_FUNCTION_ARGS);
+
+typedef struct {
+	WordEntryPosVector 	*posVector;
+	int4	posInPhrase;
+	int4 			curpos;	
+} PhraseInfo;
+
+static int
+WordCompareVectorEntry(char *eval, WordEntry *ptr, ParsedWord *prsdword)
+{
+	if (ptr->len == prsdword->len)
+		return strncmp(
+	   eval + ptr->pos,
+	   prsdword->word,
+	   prsdword->len);
+
+	return (ptr->len > prsdword->len) ? 1 : -1;
+}
+
+/*
+ * Returns a pointer to a WordEntry from tsvector t corresponding to prsdword. 
+ * Returns NULL if not found.
+ */
+static WordEntry *
+find_wordentry_prsdword(TSVector t, ParsedWord *prsdword)
+{
+	WordEntry  *StopLow = ARRPTR(t);
+	WordEntry  *StopHigh = (WordEntry *) STRPTR(t);
+	WordEntry  *StopMiddle;
+	int			difference;
+
+	/* Loop invariant: StopLow <= item < StopHigh */
+
+	while (StopLow < StopHigh)
+	{
+		StopMiddle = StopLow + (StopHigh - StopLow) / 2;
+		difference = WordCompareVectorEntry(STRPTR(t), StopMiddle, prsdword);
+		if (difference == 0)
+			return StopMiddle;
+		else if (difference < 0)
+			StopLow = StopMiddle + 1;
+		else
+			StopHigh = StopMiddle;
+	}
+
+	return NULL;
+}
+
+
+static int4 
+check_and_advance(int4 i, PhraseInfo *phraseInfo)
+{
+ 	WordEntryPosVector *posvector1, *posvector2;
+	int4 diff;
+
+	posvector1 = phraseInfo[i].posVector;
+posvector2 = phraseInfo[i+1].posVector;
+	
+	diff = phraseInfo[i+1].posInPhrase - phraseInfo[i].posInPhrase;
+	while (posvector2->pos[phraseInfo[i+1].curpos] - posvector1->pos[phraseInfo[i].curpos] < diff)
+		if (phraseInfo[i+1].curpos >= posvector2->npos - 1)
+			return 2;
+		else
+			phraseInfo[i+1].curpos += 1;
+
+	if (posvector2->pos[phraseInfo[i+1].curpos] - posvector1->pos[phraseInfo[i].curpos] == diff)
+		return 1;
+	else
+		return 0;
+}
+
+int4
+initialize_phraseinfo(ParsedText *prs, TSVector t, PhraseInfo *phraseInfo)
+{
+	WordEntry *entry;
+	int4 i;
+
+	for (i = 0; i < prs->curwords; i++)
+	{
+		phraseInfo[i].posInPhrase = prs->words[i].pos.pos;
+		entry = find_wordentry_prsdword(t, &(prs->words[i]));
+		if (entry == NULL)
+			return 0;
+		else
+			phraseInfo[i].posVector = _POSVECPTR(t, entry);
+	}			
+	return 1;
+}
+Datum
+is_phrase_present(PG_FUNCTION_ARGS)
+{
+	ParsedText	prs;
+	int4		numwords, i, retval, found = 0;
+	PhraseInfo  *phraseInfo;
+	text	*phrase	= PG_GETARG_TEXT_P(0);
+	TSVector 	t	= PG_GETARG_TSVECTOR(1);
+Oid	cfgId   = getTSCurrentConfig(true);
+
+	prs.lenwords = (VARSIZE(phrase) - VARHDRSZ) / 6;/* just estimation of* word's number */
+	if (prs.lenwords == 0)
+		prs.lenwords = 2;
+	prs.curwords = 0;
+	prs.pos = 0;
+	prs.words = (ParsedWord *) palloc0(sizeof(ParsedWord) * prs.lenwords);
+
+	parsetext(cfgId, &prs, VARDATA(phrase), VARSIZE(phrase) - VARHDRSZ);
+
+	// allocate & initialize 
+	numwords 	= prs.curwords;
+	phraseInfo	= palloc0(numwords * sizeof(PhraseInfo));
+
+	
+	if (numwords > 0 && in

Re: [HACKERS] phrase search

2008-06-02 Thread Sushant Sinha
On Mon, 2008-06-02 at 19:39 +0400, Teodor Sigaev wrote:
> 
> > I have attached a patch for phrase search with respect to the cvs head.
> > Basically it takes a a phrase (text) and a TSVector. It checks if the
> > relative positions of lexeme in the phrase are same as in their
> > positions in TSVector.
> 
> Ideally, phrase search should be implemented as new operator in tsquery, say 
> # 
> with optional distance. So, tsquery 'foo #2 bar' means: find all texts where 
> 'bar' is place no far than two word from 'foo'. The complexity is about 
> complex 
> boolean expressions ( 'foo #1 ( bar1 & bar2 )' ) and about several languages 
> as 
> norwegian or german. German language has combining words, like a footboolbar  
> - 
>   and they have several variants of splitting, so result of to_tsquery('foo # 
> footboolbar') will be a 'foo # ( ( football & bar ) | ( foot & ball & bar ) )'
> where variants are connected with OR operation.

This is far more complicated than I thought.

> Of course, phrase search should be able to use indexes.

I can probably look into how to use index. Any pointers on this?

> > 
> > If the configuration for text search is "simple", then this will produce
> > exact phrase search. Otherwise the stopwords in a phrase will be ignored
> > and the words in a phrase will only be matched with the stemmed lexeme.
> 
> Your solution can't be used as is, because user should use tsquery too to use 
> an 
> index:
> 
> column @@ to_tsquery('phrase search') AND  is_phrase_present('phrase search', 
> column)
> 
> First clause will be used for index scan and it will fast search a candidates.

Yes this is exactly how I am using in my application. Do you think this
will solve a lot of common case or we should try to get phrase search

1. Use index
2. Support arbitrary distance between lexemes
3. Support complex boolean queries

-Sushant. 

> 
> > For my application I am using this as a separate shared object. I do not
> > know how to expose this function from the core. Can someone explain how
> > to do this?
> 
> Look at contrib/ directory in pgsql's source code - make a contrib module 
> from 
> your patch. As an example, look at adminpack module - it's rather simple.
> 
> Comments of your code:
> 1)
> +#ifdef PG_MODULE_MAGIC
> +PG_MODULE_MAGIC;
> +#endif
> 
> That isn't needed for compiled-in in core files, it's only needed for modules.
> 
> 2)
>   use only /**/ comments, do not use a // (C++ style) comments


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-06-02 Thread Sushant Sinha
Efficiency: I realized that we do not need to store all norms. We need
to only store store norms that are in the query. So I moved the addition
of norms from addHLParsedLex to hlfinditem. This should add very little
memory overhead to existing headline generation.

If this is still not acceptable for default headline generation, then I
can push it into mark_hl_fragments. But I think any headline marking
function will benefit by having the norms corresponding to the query.

Why we need norms?

hlCover does the exact thing that Cover in tsrank does which is to find
the  cover that contains the query. However hlcover has to go through
words that do not match the query. Cover on the other hand operates on
position indexes for just the query words and so it should be faster. 

The main reason why I would I like it to be fast is that I want to
generate all covers for a given query. Then choose covers with smallest
length as they will be the one that will best explain relation of a
query to a document. Finally stretch those covers to the specified size.

In my understanding, the current headline generation tries to find the
biggest cover for display in the headline. I personally think that such
a cover does not explain the context of a query in a document. We may
differ on this and thats why we may need both options.

Let me know what you think on this patch and I will update the patch to
respect other options like MinWords and ShortWord. 

NumFragments < 2:
I wanted people to use the new headline marker if they specify
NumFragments >= 1. If they do not specify the NumFragments or put it to
0 then the default marker will be used. This becomes a bit of tricky
parameter so please put in any idea on how to trigger the new marker.

On an another note I found that make_tsvector crashes if it receives a
ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
as 1 even when it gets 0 words. I am not sure if this is the desired
behavior.

-Sushant.


On Mon, 2008-06-02 at 18:10 +0400, Teodor Sigaev wrote:
> > I have attached a new patch with respect to the current cvs head. This
> > produces headline in a document for a given query. Basically it
> > identifies fragments of text that contain the query and displays them.
> New variant is much better, but...
> 
> >  HeadlineParsedText contains an array of  actual words but not
> > information about the norms. We need an indexed position vector for each
> > norm so that we can quickly evaluate a number of possible fragments.
> > Something that tsvector provides.
> 
> Why do you need to store norms? The single purpose of norms is identifying 
> words 
> from query - but it's already done by hlfinditem. It sets 
> HeadlineWordEntry->item to corresponding QueryOperand in tsquery.
> Look, headline function is rather expensive and your patch adds a lot of 
> extra 
> work  - at least in memory usage. And if user calls with NumFragments=0 the 
> that 
> work is unneeded.
> 
> > This approach does not change any other interface and fits nicely with
> > the overall framework.
> Yeah, it's a really big step forward. Thank you. You are very close to 
> committing except: Did you find a hlCover() function which produce a cover 
> from 
> original HeadlineParsedText representation? Is any reason to do not use it?
> 
> > 
> > The norms are converted into tsvector and a number of covers are
> > generated. The best covers are then chosen to be in the headline. The
> > covers are separated using a hardcoded coversep. Let me know if you want
> > to expose this as an option.
> 
> 
> > 
> > Covers that overlap with already chosen covers are excluded.
> > 
> > Some options like ShortWord and MinWords are not taken care of right
> > now. MaxWords are used as maxcoversize. Let me know if you would like to
> > see other options for fragment generation as well.
> ShortWord, MinWords and MaxWords should store their meaning, but for each 
> fragment, not for the whole headline.
> 
> 
> > 
> > Let me know any more changes you would like to see.
> 
>  if (num_fragments == 0)
>  /* call the default headline generator */
>  mark_hl_words(prs, query, highlight, shortword, min_words, 
> max_words);
>  else
>  mark_hl_fragments(prs, query, highlight, num_fragments, 
> max_words);
> 
> 
> Suppose, num_fragments < 2?
> 
Index: src/backend/tsearch/dict.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/dict.c,v
retrieving revision 1.5
diff -u -r1.5 dict.c
--- src/backend/tsearch/dict.c	25 Mar 2008 22:42:43 -	1.5
+++ src/backend/tsearch/dict.c	30 May 2008 23:20:57 -
@@ -16,6 +16,7 @@
 #include "catalog/pg_type.h"
 #include "tsearch/ts_cache.h"
 #include "tsearch/ts_utils.h"
+#include "tsearch/ts_public.h"
 #include "utils/builtins.h"
 
 
Index: src/backend/tsearch/to_tsany.c
===
RCS file: /h

Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-06-03 Thread Sushant Sinha
My main argument for using Cover instead of hlCover was that Cover will
be faster. I tested the default headline generation that uses hlCover
with the current patch that uses Cover. There was not much difference.
So I think you are right in that we do not need norms and we can just
use hlCover.

I also compared performance of ts_headline with my first patch to
headline generation (one which was a separate function and took tsvector
as input). The performance was dramatically different. For one query
ts_headline took roughly 200 ms while headline_with_fragments took just
70 ms. On an another query ts_headline took 76 ms while
headline_with_fragments took 24 ms. You can find 'explain analyze' for
the first query at the bottom of the page. 

These queries were run multiple times to ensure that I never hit the
disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running
Linux 2.6.22-gentoo-r8.

A couple of caveats: 

1. ts_headline testing was done with current cvs head where as
headline_with_fragments was done with postgres 8.3.1.

2. For headline_with_fragments, TSVector for the document was obtained
by joining with another table.

Are these differences understandable?

If you think these caveats are the reasons or there is something I am
missing, then I can repeat the entire experiments with exactly the same
conditions. 

-Sushant.


Here is 'explain analyze' for both the functions:


ts_headline


lawdb=# explain analyze SELECT ts_headline('english', doc, q, '')
FROMdocraw, plainto_tsquery('english', 'freedom of
speech') as q
WHERE   docraw.tid = 125596;
 QUERY
PLAN 

 Nested Loop  (cost=0.00..8.31 rows=1 width=497) (actual
time=199.692..200.207 rows=1 loops=1)
   ->  Index Scan using docraw_pkey on docraw  (cost=0.00..8.29 rows=1
width=465) (actual time=0.041..0.065 rows=1 loops=1)
 Index Cond: (tid = 125596)
   ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.010..0.014 rows=1 loops=1)
 Total runtime: 200.311 ms


headline_with_fragments
---

lawdb=# explain analyze SELECT headline_with_fragments('english',
docvector, doc, q, 'MaxWords=40')
FROMdocraw, docmeta, plainto_tsquery('english', 'freedom
of speech') as q
WHERE   docraw.tid = 125596 and docmeta.tid=125596;
 QUERY
PLAN 
--
 Nested Loop  (cost=0.00..16.61 rows=1 width=883) (actual
time=70.564..70.949 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..16.59 rows=1 width=851) (actual
time=0.064..0.094 rows=1 loops=1)
 ->  Index Scan using docraw_pkey on docraw  (cost=0.00..8.29
rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1)
   Index Cond: (tid = 125596)
 ->  Index Scan using docmeta_pkey on docmeta  (cost=0.00..8.29
rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1)
   Index Cond: (docmeta.tid = 125596)
   ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.012..0.016 rows=1 loops=1)
 Total runtime: 71.076 ms
(8 rows)


On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote:
> > Why we need norms?
> 
> We don't need norms at all - all matched HeadlineWordEntry already marked by 
> HeadlineWordEntry->item! If it equals to NULL then this word isn't contained 
> in 
> tsquery.
> 
> > hlCover does the exact thing that Cover in tsrank does which is to find
> > the  cover that contains the query. However hlcover has to go through
> > words that do not match the query. Cover on the other hand operates on
> > position indexes for just the query words and so it should be faster. 
> Cover, by definition, is a minimal continuous text's piece matched by query. 
> May 
> be a several covers in text and hlCover will find all of them. Next, 
> prsd_headline() (for now) tries to define the best one. "Best" means: cover 
> contains a lot of words from query, not less that MinWords, not greater than 
> MaxWords, hasn't words shorter that ShortWord on the begin and end of cover 
> etc.
> > 
> > The main reason why I would I like it to be fast is that I want to
> > generate all covers for a given query. Then choose covers with smallest
> hlCover generates all covers.
> 
> > Let me know what you think on this patch and I will update the patch to
> > respect other options like MinWords and ShortWord. 
> 
> As I understand, you very wish to call Cover() function instead of hlCover() 
> - 
> by design, they should be identical, but accepts different document's 
> representation. So, the best way is generalize them: develop a new one which 
> can 
> be called with some kind of callback or/and opaque structure to use it in 
> both 
> rank and headline.
> 
> > 
> > NumFragments < 2:
> > I wanted people to use the new headlin

Re: [HACKERS] phrase search

2008-06-03 Thread Sushant Sinha
On Tue, 2008-06-03 at 22:16 +0400, Teodor Sigaev wrote:
> > This is far more complicated than I thought.
> >> Of course, phrase search should be able to use indexes.
> > I can probably look into how to use index. Any pointers on this?
> 
> src/backend/utils/adt/tsginidx.c, if you invent operation #  in tsquery then 
> you 
> will have index support with minimal effort.
> > 
> > Yes this is exactly how I am using in my application. Do you think this
> > will solve a lot of common case or we should try to get phrase search
> 
> Yeah, it solves a lot of useful case, for simple use it's needed to invent 
> function similar to existsing plaitnto_tsquery, say phraseto_tsquery. It 
> should 
> produce correct tsquery with described above operations.
> 

I can add index support and support for arbitrary distance between
lexeme. 

It appears to me that supporting arbitrary boolean expression will be
complicated. Can we pull out something from TSQuery?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-06-21 Thread Sushant Sinha
I have an attached an updated patch with following changes:

1. Respects ShortWord and MinWords
2. Uses hlCover instead of Cover
3. Does not store norm (or lexeme) for headline marking
4. Removes ts_rank.h
5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
only counts the actual words and excludes spaces etc.

I have also changed NumFragments option to MaxFragments as there may not
be enough covers to display NumFragments.

Another change that I was thinking:

Right now if cover size > max_words then I just cut the trailing words.
Instead I was thinking that we should split the cover into more
fragments such that each fragment contains a few query words. Then each
fragment will not contain all query words but will show more occurrences
of query words in the headline. I would  like to know what your opinion
on this is.

-Sushant.

On Thu, 2008-06-05 at 20:21 +0400, Teodor Sigaev wrote:
> > A couple of caveats: 
> > 
> > 1. ts_headline testing was done with current cvs head where as
> > headline_with_fragments was done with postgres 8.3.1.
> > 2. For headline_with_fragments, TSVector for the document was obtained
> > by joining with another table.
> > Are these differences understandable?
> 
> That is possible situation because ts_headline has several criterias of 
> 'best' 
> covers - length, number of words from query, good words at the begin and at 
> the 
> end of headline while your fragment's algorithm takes care only on total 
> number 
> of words in all covers. It's not very good, but it's acceptable, I think. 
> Headline (and ranking too) hasn't any formal rules to define is it good or 
> bad? 
> Just a people's opinions.
> 
> Next possible reason: original algorithm had a look on all covers trying to 
> find 
> the best one while your algorithm tries to find just the shortest covers to 
> fill 
> a headline.
> 
> But it's very desirable to use ShortWord - it's not very comfortable for user 
> if 
> one option produces unobvious side effect with another one.
> `
> 
> > If you think these caveats are the reasons or there is something I am
> > missing, then I can repeat the entire experiments with exactly the same
> > conditions. 
> 
> Interesting for me test is a comparing hlCover with Cover in your patch, i.e. 
> develop a patch which uses hlCover instead of Cover and compare  old patch 
> with 
> new one.
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.14
diff -c -r1.14 wparser_def.c
*** src/backend/tsearch/wparser_def.c	1 Jan 2008 19:45:52 -	1.14
--- src/backend/tsearch/wparser_def.c	21 Jun 2008 07:59:02 -
***
*** 1684,1701 
  	return false;
  }
  
! Datum
! prsd_headline(PG_FUNCTION_ARGS)
  {
! 	HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0);
! 	List	   *prsoptions = (List *) PG_GETARG_POINTER(1);
! 	TSQuery		query = PG_GETARG_TSQUERY(2);
  
! 	/* from opt + start and and tag */
! 	int			min_words = 15;
! 	int			max_words = 35;
! 	int			shortword = 3;
  
  	int			p = 0,
  q = 0;
  	int			bestb = -1,
--- 1684,1891 
  	return false;
  }
  
! static void 
! mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos)
  {
! 	int   i;
! 	char *coversep = "...";
!	int   coverlen = strlen(coversep);
  
! 	for (i = startpos; i <= endpos; i++)
! 	{
! 		if (prs->words[i].item)
! 			prs->words[i].selected = 1;
! 		if (highlight == 0)
! 		{
! 			if (HLIDIGNORE(prs->words[i].type))
! prs->words[i].replace = 1;
! 		}
! 		else
! 		{
! 			if (XMLHLIDIGNORE(prs->words[i].type))
! prs->words[i].replace = 1;
! 		}
! 
! 		prs->words[i].in = (prs->words[i].repeated) ? 0 : 1;
! 	}
! 	/* add cover separators if needed */ 
! 	if (startpos > 0 && strncmp(prs->words[startpos-1].word, coversep, 
! 		prs->words[startpos-1].len) != 0)
! 	{
! 		
! 		prs->words[startpos-1].word = repalloc(prs->words[startpos-1].word, sizeof(char) * coverlen);
! 		prs->words[startpos-1].in   = 1;
! 		prs->words[startpos-1].len  = coverlen;
! 		memcpy(prs->words[startpos-1].word, coversep, coverlen);
! 	}
! 	if (endpos-1 < prs->curwords &&  strncmp(prs->words[startpos-1].word, coversep,
! 		prs->words[startpos-1].len) != 0)
! 	{
! 		prs->words[endpos+1].word = repalloc(prs->words[endpos+1].word, sizeof(char) * coverlen);
! 		prs->words[endpos+1].in   = 1;
! 		memcpy(prs->words[endpos+1].word, coversep, coverlen);
! 	}
! }
! 
! typedef struct 
! {
! 	int4 startpos;
! 	int4 endpos;
! 	int2 in;
! 	int2 excluded;
! } CoverPos;
! 
! 
! static void
! mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight,
! int shortword, int min_words, 
! 			int max_words, int max_fragments)
! {
! 	int4   	curlen, coverlen, i, f, num_f;
! 	int4		stretch, maxstretch;
! 
! 	int4   	startpos = 0, 
!  			endpos   = 0,
! 			p= 0,
! 

[HACKERS] initdb in current cvs head broken?

2008-07-10 Thread Sushant Sinha
I am trying to generate a patch with respect to the current CVS head. So
ai rsynced the tree, then did cvs up and installed the db. However, when
I did initdb on a data directory it is stuck:

It is stuck after printing creating template1
creating template1 database in /home/postgres/data/base/1 ... 

I did strace  

$ strace -p 9852
Process 9852 attached - interrupt to quit
waitpid(9864,

then I  straced 9864

$ strace -p 9864
Process 9864 attached - interrupt to quit
semop(8060958, 0xbff36fee,

 $ ps aux|grep 9864   
postgres  9864  1.5  1.3  37296  6816 pts/1S+   07:51
0:02 /usr/local/pgsql/bin/postgres --boot -x1 -F


Seems like a bug to me. Is the tree stable only after commit fests and I
should not use the unstable tree for generating patches?

Thanks,
-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb in current cvs head broken?

2008-07-10 Thread Sushant Sinha
You are right. I did not do make clean last time. After make clean, make
all, and make install it works fine. 

-Sushant.

On Thu, 2008-07-10 at 17:55 +0530, Pavan Deolasee wrote:
> On Thu, Jul 10, 2008 at 5:36 PM, Sushant Sinha <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> > Seems like a bug to me. Is the tree stable only after commit fests and I
> > should not use the unstable tree for generating patches?
> >
> 
> I quickly tried on my repo and its working fine. (Well it could be a
> bit out of sync with the head).
> 
> Usually, the tree may get a bit inconsistent during the active period,
> but its not very common. I've seen committers doing a good job before
> checking in any code and making sure it works fine (atleast initdb and
> regression tests).
> 
> I would suggest doing a clean build at your end once again.
> 
> Thanks,
> Pavan
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] possible bug in cover density ranking?

2009-01-28 Thread Sushant Sinha
I am running postgres 8.3.1. In tsrank.c I am looking at the cover
density function used for ranking while doing text search:
float4
calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method)


Here is the excerpt of code that I think may possibly have bug when
document is big enough to exceed the 16383 position limit.

CODE
===
Cpos = ((double) (ext.end - ext.begin + 1)) / InvSum;

/*
 * if doc are big enough then ext.q may be equal to ext.p due to limit
 * of posional information. In this case we approximate number of
 * noise word as half cover's length
 */
nNoise = (ext.q - ext.p) - (ext.end - ext.begin);
if (nNoise < 0)
nNoise = (ext.end - ext.begin) / 2
Wdoc += Cpos / ((double) (1 + nNoise));
===

As per my understanding, ext.end -ext.begin + 1 is the number of query
items in the cover and ext.q-ext.p says the length of the cover.

So consider a query with two query items. When we run out of position
information, Cover returns ext.q = 16383 and ext.p = 16383 and the
number of query items= ext.end-ext-begin + 1 = 2

nNoise becomes -1 and then nNoise is initialized to (ext.end
-ext.begin)/2 = 0
Wdoc becomes Cpos = 2/InvSum = 2/(1/0.1+1/0.1) = 0.1

Is this what is desired? It seems to me that Wdoc is getting a high
ranking even when we are not sure of the position information. 

The comment above says that "In this case we approximate number of
noise word as half cover's length". But we do not know the cover's
length in this case as ext.p and ext.q are both unreliable. And ext.end
-ext.begin is not the "cover's length". It is the number of query items
found in the cover.

Any clarification would be useful. 

Thanks,
-Sushant.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible bug in cover density ranking?

2009-01-29 Thread Sushant Sinha
On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev  wrote:

> Is this what is desired? It seems to me that Wdoc is getting a high
>> ranking even when we are not sure of the position information.
>>
> 0.1 is not very high rank, and we could not suggest any reasonable rank in
> this case. This document may be good, may be bad. rank_cd is not limited by
> 1.



For a cover of 2 query items, 0.1 is actually the maximum rank. This is only
possible when both query items are adjacent to each other.

0.1 may not seem too high when we look at its absoule value. But the problem
is we are ranking a document for which we have no positional information
available higher than a document for which we may have positional
information available with let suppose the cover length of 3. I think we
should rank the document with cover length 3 higher than the document for
which we have no positional information (and assume cover length of 2 as we
are doing now).

I feel that if ext.p=ext.q for query items > 1, then we should not count
that cover for ranking at all. Or, another option will be to significantly
inflate nNoise in this scenrio to  say 100. Putting
nNoise=(ext.end-ext.begin)/2 is way too low for covers that we have no idea
on (it is 0 for query items = 2).

I am not assuming or suggesting that rank_cd is bounded by one. Off course
its rank increases as more and more covers are added.

Thanks,
Sushant.

>
>
>
>> The comment above says that "In this case we approximate number of
>> noise word as half cover's length". But we do not know the cover's
>> length in this case as ext.p and ext.q are both unreliable. And ext.end
>> -ext.begin is not the "cover's length". It is the number of query items
>> found in the cover.
>>
>
> Yeah, but if there is no information then information is absent :), but I
> agree with you to change comment
> --
> Teodor Sigaev   E-mail: teo...@sigaev.ru
>   WWW:
> http://www.sigaev.ru/
>


Re: [HACKERS] Ellipses around result fragment of ts_headline

2009-02-14 Thread Sushant Sinha
I think we currently do that. We add ellipses only when we encounter a
new fragment. So there should not be ellipses if we are at the end of
the document or if that is the first fragment (includes the beginning of
the document). Here is the code in generateHeadline, ts_parse.c that
adds the ellipses:

if (!infrag)
{

/* start of a new fragment */
infrag = 1;
numfragments ++;
/* add a fragment delimitor if this is after the first
one */
if (numfragments > 1)
{
memcpy(ptr, prs->fragdelim, prs->fragdelimlen);
ptr += prs->fragdelimlen;
}

}

It is possible that there is a bug that needs to be fixed. Can you show
me an example where you found that?

-Sushant.




On Sat, 2009-02-14 at 15:13 -0500, Asher Snyder wrote:
> It would be very useful if there were an option to have ts_headline append
> ellipses before or after a result fragement based on the position of the
> fragment in the source document. For instance, when running ts_headline(doc,
> query) it will correctly return a fragment with words highlighted, however,
> there's no easy way to determine whether this returned fragment is at the
> beginning or end of the original doc, and add the necessary ellipses. 
> 
> Searches such as postgresql.org ALWAYS add ellipses before or after the
> fragment regardless of whether or not ellipses are warranted. In my opinion
> always adding ellipses to the fragment is deceptive to the user, in many of
> my search result cases, the fragment is at the beginning of the doc, and
> would confuse the user to always see ellipses. So you can see how useful the
> feature described above would be beneficial to the accuracy of the search
> result fragment.
> 
> 
> 
> 
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ellipses around result fragment of ts_headline

2009-02-14 Thread Sushant Sinha
The documentation in 8.4dev has information on FragmentDelimiter
http://developer.postgresql.org/pgdocs/postgres/textsearch-controls.html

If you do not specify MaxFragments > 0, then the default headline
generator kicks in. The default headline generator does not have any
fragment delimiter. So it is correct that you will not see any
delimiter.

I think you are looking for the default headline generator to add
ellipses as  well depending on where the fragment is. I do not what
other people opinion on this is.

-Sushant.

On Sat, 2009-02-14 at 16:21 -0500, Asher Snyder wrote:
> Interesting, it could be that you already do it, but the documentation makes
> no reference to a fragment delimiter, so there's no way that I can see to
> add one. The documentation for ts_headline only lists StartSel, StopSel,
> MaxWords, MinWords, ShortWord, and HighlightAll, there appears to be no
> option for a fragment delimiter.
> 
> In my case I do:
> 
> SELECT v1.id, v1.type_id, v1.title, ts_headline(v1.copy, query, 'MinWords =
> 17') as copy, ts_rank(v1.text_search, query) AS rank FROM 
>   (SELECT b1.*, (setweight(to_tsvector(coalesce(b1.title,'')), 'A')
> ||
>  setweight(to_tsvector(coalesce(b1.copy,'')), 'B')) as text_search
>FROM search.v_searchable_content b1) v1,  
>   plainto_tsquery($1) query
> WHERE ($2 IS NULL OR (type_id = ANY($2))) AND query @@ v1.text_search ORDER
> BY rank DESC, title
> 
> Now, this use of ts_headline correctly returns me highlighted fragmented
> search results, but there will be no fragment delimiter for the headline.
> Some suggestions were to change ts_headline(v1.copy, query, 'MinWords = 17')
> to '...' || _headline(v1.copy, query, 'MinWords = 17') || '...',  but as you
> can clearly see this would always occur, and not be intelligent regarding
> the fragments. I hope that you're correct and that it is implemented, and
> not documented
> 
> >-Original Message-
> >From: Sushant Sinha [mailto:sushant...@gmail.com]
> >Sent: Saturday, February 14, 2009 4:07 PM
> >To: Asher Snyder
> >Cc: pgsql-hackers@postgresql.org
> >Subject: Re: [HACKERS] Ellipses around result fragment of ts_headline
> >
> >I think we currently do that. We add ellipses only when we encounter a
> >new fragment. So there should not be ellipses if we are at the end of
> >the document or if that is the first fragment (includes the beginning of
> >the document). Here is the code in generateHeadline, ts_parse.c that
> >adds the ellipses:
> >
> >if (!infrag)
> >{
> >
> >/* start of a new fragment */
> >infrag = 1;
> >numfragments ++;
> >/* add a fragment delimitor if this is after the first
> >one */
> >if (numfragments > 1)
> >{
> >memcpy(ptr, prs->fragdelim, prs->fragdelimlen);
> >ptr += prs->fragdelimlen;
> >}
> >
> >}
> >
> >It is possible that there is a bug that needs to be fixed. Can you show
> >me an example where you found that?
> >
> >-Sushant.
> >
> >
> >
> >
> >On Sat, 2009-02-14 at 15:13 -0500, Asher Snyder wrote:
> >> It would be very useful if there were an option to have ts_headline
> >append
> >> ellipses before or after a result fragement based on the position of
> >the
> >> fragment in the source document. For instance, when running
> >ts_headline(doc,
> >> query) it will correctly return a fragment with words highlighted,
> >however,
> >> there's no easy way to determine whether this returned fragment is at
> >the
> >> beginning or end of the original doc, and add the necessary ellipses.
> >>
> >> Searches such as postgresql.org ALWAYS add ellipses before or after
> >the
> >> fragment regardless of whether or not ellipses are warranted. In my
> >opinion
> >> always adding ellipses to the fragment is deceptive to the user, in
> >many of
> >> my search result cases, the fragment is at the beginning of the doc,
> >and
> >> would confuse the user to always see ellipses. So you can see how
> >useful the
> >> feature described above would be beneficial to the accuracy of the
> >search
> >> result fragment.
> >>
> >>
> >>
> >>
> >>
> 
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ellipses around result fragment of ts_headline

2009-02-14 Thread Sushant Sinha
Sorry ... I thought you were running the development branch.

-Sushant.

On Sat, 2009-02-14 at 16:34 -0500, Tom Lane wrote:
> Sushant Sinha  writes:
> > I think we currently do that.
> 
> ... since about four months ago.
> 
> 2008-10-17 14:05  teodor
> 
>   * doc/src/sgml/textsearch.sgml, src/backend/tsearch/ts_parse.c,
>   src/backend/tsearch/wparser_def.c, src/include/tsearch/ts_public.h,
>   src/test/regress/expected/tsearch.out,
>   src/test/regress/sql/tsearch.sql: Improve headeline generation. Now
>   headline can contain several fragments a-la Google.
>   
>   Sushant Sinha 
> 
>   regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch for space around the FragmentDelimiter

2009-03-01 Thread Sushant Sinha
FragmentDelimiter is an argument for ts_headline function to separates
different headline fragments. The default delimiter is " ... ".
Currently if someone specifies the delimiter as an option to the
function, no extra space is added around the delimiter. However, it does
not look good without space around the delimter. 

Since the option parsing function removes any space around the  given
value, it is not possible to add any desired space. The attached patch
adds space when a FragmentDelimiter is specified.

QUERY:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', 'Coleridge & stuck'),
'MaxFragments=2,FragmentDelimiter=***');

OLD RESULT
ts_headline 

 after day, day after day,
   We stuck, nor breath nor motion,
 As idle as a painted Ship
   Upon a painted Ocean.
 Water, water, every where
   And all the boards did shrink;
 Water, water, every where***drop to drink.
 S. T. Coleridge
(1 row)




NEW RESULT after the patch

 ts_headline  
--
 after day, day after day,
   We stuck, nor breath nor motion,
 As idle as a painted Ship
   Upon a painted Ocean.
 Water, water, every where
   And all the boards did shrink;
 Water, water, every where *** drop to drink.
 S. T. Coleridge



Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/sushant/devel/pgrep/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.20
diff -c -r1.20 wparser_def.c
*** src/backend/tsearch/wparser_def.c	15 Jan 2009 16:33:59 -	1.20
--- src/backend/tsearch/wparser_def.c	2 Mar 2009 06:00:02 -
***
*** 2082,2087 
--- 2082,2088 
  	int			shortword = 3;
  	int			max_fragments = 0;
  	int			highlight = 0;
+ 	int			len;
  	ListCell   *l;
  
  	/* config */
***
*** 2105,2111 
  		else if (pg_strcasecmp(defel->defname, "StopSel") == 0)
  			prs->stopsel = pstrdup(val);
  		else if (pg_strcasecmp(defel->defname, "FragmentDelimiter") == 0)
! 			prs->fragdelim = pstrdup(val);
  		else if (pg_strcasecmp(defel->defname, "HighlightAll") == 0)
  			highlight = (pg_strcasecmp(val, "1") == 0 ||
  		 pg_strcasecmp(val, "on") == 0 ||
--- 2106,2116 
  		else if (pg_strcasecmp(defel->defname, "StopSel") == 0)
  			prs->stopsel = pstrdup(val);
  		else if (pg_strcasecmp(defel->defname, "FragmentDelimiter") == 0)
! 		{
! 			len = strlen(val) + 2 + 1;/* 2 for spaces and 1 for end of string */
! 			prs->fragdelim = palloc(len * sizeof(char));
! 			snprintf(prs->fragdelim, len, " %s ", val);
! 		}
  		else if (pg_strcasecmp(defel->defname, "HighlightAll") == 0)
  			highlight = (pg_strcasecmp(val, "1") == 0 ||
  		 pg_strcasecmp(val, "on") == 0 ||
Index: src/test/regress/expected/tsearch.out
===
RCS file: /home/sushant/devel/pgrep/pgsql/src/test/regress/expected/tsearch.out,v
retrieving revision 1.15
diff -c -r1.15 tsearch.out
*** src/test/regress/expected/tsearch.out	17 Oct 2008 18:05:19 -	1.15
--- src/test/regress/expected/tsearch.out	2 Mar 2009 02:02:38 -
***
*** 624,630 
   
   Sea view wow foo bar qq
   http://www.google.com/foo.bar.html"; target="_blank">YES  
!   ff-bg
   
  document.write(15);
   
--- 624,630 
   
   Sea view wow foo bar qq
   http://www.google.com/foo.bar.html"; target="_blank">YES  
!  ff-bg
   
  document.write(15);
   
***
*** 712,726 
Nor any drop to drink.
  S. T. Coleridge (1772-1834)
  ', to_tsquery('english', 'Coleridge & stuck'), 'MaxFragments=2,FragmentDelimiter=***');
! ts_headline 
! 
   after day, day after day,
 We stuck, nor breath nor motion,
   As idle as a painted Ship
 Upon a painted Ocean.
   Water, water, every where
 And all the boards did shrink;
!  Water, water, every where***drop to drink.
   S. T. Coleridge
  (1 row)
  
--- 712,726 
Nor any drop to drink.
  S. T. Coleridge (1772-1834)
  ', to_tsquery('english', 'Coleridge & stuck'), 'MaxFragments=2,FragmentDelimiter=***');
!  ts_headline  
! --
   after day, day after day,
 We stuck, nor breath nor motion,
   As idle as a painted Ship
 Upon a painted Ocean.
   Water, water, every where
 And all the boards did shrink;
!  Water, water, every where *** drop to drink.
   S. T. Coleridge
  (1 row)
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http:

Re: [HACKERS] patch for space around the FragmentDelimiter

2009-03-01 Thread Sushant Sinha
yeah you are right. I did not know that you can pass space using double
quotes.

-Sushant.

On Sun, 2009-03-01 at 20:49 -0500, Tom Lane wrote:
> Sushant Sinha  writes:
> > FragmentDelimiter is an argument for ts_headline function to separates
> > different headline fragments. The default delimiter is " ... ".
> > Currently if someone specifies the delimiter as an option to the
> > function, no extra space is added around the delimiter. However, it does
> > not look good without space around the delimter. 
> 
> Maybe not to you, for the particular delimiter you happen to be working
> with, but it doesn't follow that spaces are always appropriate.
> 
> > Since the option parsing function removes any space around the  given
> > value, it is not possible to add any desired space. The attached patch
> > adds space when a FragmentDelimiter is specified.
> 
> I think this is a pretty bad idea.  Better would be to document how to
> get spaces into the delimiter, ie, use double quotes:
> 
>   ... FragmentDelimiter = " ... " ...
> 
> Hmm, actually, it looks to me that the documentation already shows this,
> in the example of the default values.
> 
>   regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2009-04-13 Thread Sushant Sinha
Headline generation uses hlCover to get fragments in text with *all*
query items. In case there is no such fragment, it does not return
anything.

What you are asking will either require returning *maximally* matching
covers or handling it as a separate case.

-Sushant.


On Mon, 2009-04-13 at 20:57 -0400, Tom Lane wrote:
> Sushant Sinha  writes:
> > Sorry for the delay. Here is the patch with FragmentDelimiter option. 
> > It requires an extra option in HeadlineParsedText and uses that option
> > during generateHeadline.
> 
> I did some editing of the documentation for this patch and noticed that
> the explanation of the fragment-based headline method says
> 
>If not all query words are found in the
>document, then a single fragment of the first MinWords
>in the document will be displayed.
> 
> (That's what it says now, that is, based on my editing and testing of
> the original.)  This seems like a pretty dumb fallback approach ---
> if you have only a partial match, the headline generation suddenly
> becomes about as stupid as it could possibly be.  I could understand
> doing the above if the text actually contains *none* of the query
> words, but surely if it contains some of them we should still select
> fragments centered on those words.
> 
>   regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible bug in cover density ranking?

2009-05-01 Thread Sushant Sinha
I see this as open items here

http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items

Any interest in fixing this?

-Sushant.

On Thu, 2009-01-29 at 13:54 -0500, Sushant Sinha wrote:
> 
> 
> On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev 
> wrote:
> Is this what is desired? It seems to me that Wdoc is
> getting a high
> ranking even when we are not sure of the position
> information. 
> 0.1 is not very high rank, and we could not suggest any
> reasonable rank in this case. This document may be good, may
> be bad. rank_cd is not limited by 1.
> 
>  
> For a cover of 2 query items, 0.1 is actually the maximum rank. This
> is only possible when both query items are adjacent to each other.
> 
> 0.1 may not seem too high when we look at its absoule value. But the
> problem is we are ranking a document for which we have no positional
> information available higher than a document for which we may have
> positional information available with let suppose the cover length of
> 3. I think we should rank the document with cover length 3 higher than
> the document for which we have no positional information (and assume
> cover length of 2 as we are doing now).
> 
> I feel that if ext.p=ext.q for query items > 1, then we should not
> count that cover for ranking at all. Or, another option will be to
> significantly inflate nNoise in this scenrio to  say 100. Putting
> nNoise=(ext.end-ext.begin)/2 is way too low for covers that we have no
> idea on (it is 0 for query items = 2).
> 
> I am not assuming or suggesting that rank_cd is bounded by one. Off
> course its rank increases as more and more covers are added.
> 
> Thanks,
> Sushant.
> 
> 
> 
> The comment above says that "In this case we
> approximate number of
> noise word as half cover's length". But we do not know
> the cover's
> length in this case as ext.p and ext.q are both
> unreliable. And ext.end
> -ext.begin is not the "cover's length". It is the
> number of query items
> found in the cover.
> 
> 
> Yeah, but if there is no information then information is
> absent :), but I agree with you to change comment
> -- 
> Teodor Sigaev   E-mail:
> teo...@sigaev.ru
>   WWW:
> http://www.sigaev.ru/
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] dot to be considered as a word delimiter?

2009-05-29 Thread Sushant Sinha
Currently it seems like that dot is not considered as a word delimiter
by the english parser.

lawdb=# select to_tsvector('english', 'Mr.J.Sai Deepak');
   to_tsvector   
-
 'deepak':2 'mr.j.sai':1
(1 row)

So the word obtained is "mr.j.sai" rather than three words "mr", "j",
"sai"

It does it correctly if there is space in between, as space is
definitely a word delimiter.

lawdb=# select to_tsvector('english', 'Mr. J. Sai Deepak');
   to_tsvector   
-
 'j':2 'mr':1 'sai':3 'deepak':4
(1 row)


I think that dot should be considered by as a word delimiter because
when dot is not followed by a space, most of the time it is an error in
typing. Beside they are not many valid english words that have dot in
between.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dot to be considered as a word delimiter?

2009-06-02 Thread Sushant Sinha
Fair enough. I agree that there is a valid need for returning such tokens as
a host. But I think there is definitely a need to break it down into
individual words. This will help in cases when a document is missing a space
in between the words.


So what we can do is: return the entire compound word as Host and also break
it down into individual words. I can put up a patch for this if you guys
agree.

Returning multiple tokens for the same word is a feature of the text search
parser as explained in the documentation here:
http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html

Thanks,
Sushant.

On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall  wrote:

> On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
> > Sushant Sinha  wrote:
> >
> > > I think that dot should be considered by as a word delimiter because
> > > when dot is not followed by a space, most of the time it is an error
> > > in typing. Beside they are not many valid english words that have
> > > dot in between.
> >
> > It's not treating it as an English word, but as a host name.
> >
> > select ts_debug('english', 'Mr.J.Sai Deepak');
> >  ts_debug
> >
> ---
> >  (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
> >  (blank,"Space symbols"," ",{},,)
> >  (asciiword,"Word, all
> > ASCII",Deepak,{english_stem},english_stem,{deepak})
> > (3 rows)
> >
> > You could run it through a dictionary which would deal with host
> > tokens differently.  Just be aware of what you'll be doing to
> > www.google.com if you run into it.
> >
> > I hope this helps.
> >
> > -Kevin
> >
>
> In our uses for full text indexing, it is much more important to
> be able to find host name and URLs than to find mistyped names.
> My two cents.
>
> Cheers,
> Ken
>


Re: [HACKERS] It's June 1; do you know where your release is?

2009-06-02 Thread Sushant Sinha
On Tue, 2009-06-02 at 17:26 -0700, Josh Berkus wrote:
> 
>  * possible bug in cover density ranking?
> 
> -- From Teodor's response, this is maybe a doc patch and not a code 
> patch.  Teodor?  Oleg?


I personally think that this is a bug, because we are assigning very
high rank when we are not sure about the positional information. This is
not a show stopper though.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers