Fair enough. I agree that there is a valid need for returning such tokens as
a host. But I think there is definitely a need to break it down into
individual words. This will help in cases when a document is missing a space
in between the words.


So what we can do is: return the entire compound word as Host and also break
it down into individual words. I can put up a patch for this if you guys
agree.

Returning multiple tokens for the same word is a feature of the text search
parser as explained in the documentation here:
http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html

Thanks,
Sushant.

On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall <k...@rice.edu> wrote:

> On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
> > Sushant Sinha <sushant...@gmail.com> wrote:
> >
> > > I think that dot should be considered by as a word delimiter because
> > > when dot is not followed by a space, most of the time it is an error
> > > in typing. Beside they are not many valid english words that have
> > > dot in between.
> >
> > It's not treating it as an English word, but as a host name.
> >
> > select ts_debug('english', 'Mr.J.Sai Deepak');
> >                                  ts_debug
> >
> ---------------------------------------------------------------------------
> >  (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
> >  (blank,"Space symbols"," ",{},,)
> >  (asciiword,"Word, all
> > ASCII",Deepak,{english_stem},english_stem,{deepak})
> > (3 rows)
> >
> > You could run it through a dictionary which would deal with host
> > tokens differently.  Just be aware of what you'll be doing to
> > www.google.com if you run into it.
> >
> > I hope this helps.
> >
> > -Kevin
> >
>
> In our uses for full text indexing, it is much more important to
> be able to find host name and URLs than to find mistyped names.
> My two cents.
>
> Cheers,
> Ken
>

Reply via email to