On Mon, 11 Nov 2002, Spencer, Dave wrote: > Re "reducing the set of question/answer pair to consider" below - I > would expect that using synonyms either in the index or in the > reformed query would (annoyingly) increase the number of potential > matches or is there something I'm missing.
Generally, you're right. More formally, in the information retrieval community, 'recall' is defined roughly as n/r, and 'precision' as n/p, where * n is the # of relevant articles returned in response to the query * r is the total # of articles relevant to the query * p is the total # of articles returned by the query. (These are analogous to "completeness" and "correctness" of formal systems.) So things that tend to increase recall tend to decrease precision, and vice versa. Not coincidentally, one area of my research is in investigating methods that increase recall (via query expansion) but do not significantly adversely affect precision (by assigning weights to terms added to the query according to their aggregated similarity to the query terms). No conclusive results yet, I'm afraid. > Interesting that this topic just came up as I wanted to experiment > w/ the same thing. My first stab at an public domain synonym > list, the "moby" list, didn't seem to have synonyms however. > I believe another poster mentioned WordNet so I'll try that. > > I'd really like it if it was possibly to automatically determine > synonyms - maybe something similar to Latent Semantic Analysis - but > such things seem kinda hard to code up... LSA does have some advantages, but it has problems as well (e.g., last I checked, it was rather computationally expensive). There are other mechanisms for determining synonym-like relationships, however, such as measuring term-term correlations in the corpus. Something you have to be careful of in this context is in assuming that synonyms are symmetric: the connection of 'hot' to 'radioactive' (or 'spicy', or 'attractive', or ...) is not nearly as strong as the connections going the other direction. You also can get problems with homonyms like 'minute' (time period) and 'minute' (very small); clearly these two demand different classes of related terms. [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>
