Fair enough. I agree that there is a valid need for returning such tokens as a host. But I think there is definitely a need to break it down into individual words. This will help in cases when a document is missing a space in between the words.
So what we can do is: return the entire compound word as Host and also break it down into individual words. I can put up a patch for this if you guys agree. Returning multiple tokens for the same word is a feature of the text search parser as explained in the documentation here: http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html Thanks, Sushant. On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall <k...@rice.edu> wrote: > On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote: > > Sushant Sinha <sushant...@gmail.com> wrote: > > > > > I think that dot should be considered by as a word delimiter because > > > when dot is not followed by a space, most of the time it is an error > > > in typing. Beside they are not many valid english words that have > > > dot in between. > > > > It's not treating it as an English word, but as a host name. > > > > select ts_debug('english', 'Mr.J.Sai Deepak'); > > ts_debug > > > --------------------------------------------------------------------------- > > (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai}) > > (blank,"Space symbols"," ",{},,) > > (asciiword,"Word, all > > ASCII",Deepak,{english_stem},english_stem,{deepak}) > > (3 rows) > > > > You could run it through a dictionary which would deal with host > > tokens differently. Just be aware of what you'll be doing to > > www.google.com if you run into it. > > > > I hope this helps. > > > > -Kevin > > > > In our uses for full text indexing, it is much more important to > be able to find host name and URLs than to find mistyped names. > My two cents. > > Cheers, > Ken >