On Tue, Aug 14, 2012 at 06:22:15AM -0400, Kathy Lussier wrote: > Hi all, > > We've had difficulty finding records in our catalog due to the > automatic stemming that occurs when records are indexed in > Evergreen. As an example, a title on one of our summer readings > lists was "The Assist" by Neil Swidey. However, when users were > searching for "the assist" as a title search with the phrase > enclosed in quotations, they still had to page through several pages > of results before finding the title they needed. Many of the records > that ranked higher contained words like "assistance", "assistive", > "assisted", etc. because they were automatically stemmed at > indexing, and the stemmed version of the word (assist) was what was > stored in the index vector column. We've had many other examples > where this stemming has made it difficult to conduct searches.
This particular example is quite a concern! I haven't noticed anything similar yet, since we moved to Evergreen 2.3ish last week, and nobody has brought a similar problem to my attention, but it might just be early days for us. > In digging through IRC logs and other list messages regarding > stemming, people have mentioned that this stemming can be turned off > so that the full words are indexed rather than the stemmed versions > of a word. Can anybody tell me how this is done? I understand that > the records would need to be reingested, but is there a flag that > needs to be disabled to turn off the stemming or does it require > something else? The simplest way to do this in a new Evergreen instance is to change the configuration of the text search dictionary in Open-ILS/src/sql/Pg/000.english.pg91.fts-config.sql - for example, instead of using the snowball stemming algorithm as a basis for the full text search, just use the "simple" dictionary which returns the lowercase version of the incoming text: CREATE TEXT SEARCH DICTIONARY english_nostop (TEMPLATE=pg_catalog.simple); Note, however, that this is likely to cause other problems for searchers; in the default "concerto" sample set of records, for example, people will have to search for "concertos" to get matches for "concertos"; "concerto" won't result in a match (and vice versa). > Also, is there a way to use another dictionary for > the stemmer so that the stemming is somewhat less aggressive than is > used by the snowball stemmer? Overall, we like the concept of > stemming, particularly when it retrieves results for both singular > and plural versions of a word, but we've had many examples where > stemming seems to be throwing users off course. ispell support was added in the last few versions of PostgreSQL, which might be worth exploring. I plan to dig into the current state of PostgreSQL full-text search over the next few weeks, so the timing of your question is quite good!
