Adding terms to posting lists is about the most space-efficient thing you can do in a search engine, so I would not worry too much about that.
wunder On Jul 18, 2013, at 10:06 AM, Shai Erera wrote: > We index time synonyms means you bloat the index with a lot of new postings, > most of them are just duplicates of each other. And in my case, cause for > every synonym there's a weight, I cannot even consider postings > deduplication... > > There's a tradeoff here (as usual). Both approaches have pros and cons. I > think index time is better in the end because a larger index can be solved by > throwing more hardware at it. But queries with thousands of terms are a real > issue. > > One thing I can look at is if the synonym sets can be 'grouped' in a way that > instead of all the terms I index a group ID or something and then during > search i resolve a term to all the groups it may belong to... I'll need to > think about it more. > > On Jul 18, 2013 7:49 PM, "Walter Underwood" <wun...@wunderwood.org> wrote: > There are two serious issues with query-time synonyms, speed and correctness. > > 1. Expanding a term to 1000 synonyms at query time means 1000 term lookups. > This will not be fast. Expanding the term at index time means 1000 posting > list entries, but only one term lookup at query time. > > 2. Query time expansion will give higher scores to the more rare synonyms. > This is almost never what you want. If I make "TV" and "television" synonyms, > I want them both to score the same. But if TV is 10X more common than > television, then documents with the rare term (television) will score better. > > wunder > > On Jul 18, 2013, at 5:54 AM, Shai Erera wrote: > >> The examples I've seen so far are single words. But I learned today >> something new .. the number of "synonyms" returned for a word may be in the >> range of hundreds, sometimes even thousands. >> So I'm not sure query-time synonyms may work at all .. what do you think? >> >> Shai >> >> >> On Thu, Jul 18, 2013 at 3:21 PM, Jack Krupansky <j...@basetechnology.com> >> wrote: >> Your best bet is to preprocess queries and expand synonyms in your own >> application layer. The Lucene/Solr synonym implementation, design, and >> architecture is fairly lightweight (although FST is a big improvement) and >> not architected for large and dynamic synonym sets. >> >> Do you need multi-word phrase synonyms as well, or is this strictly >> single-word synonyms? >> >> -- Jack Krupansky >> >> From: Shai Erera >> Sent: Thursday, July 18, 2013 1:36 AM >> To: dev@lucene.apache.org >> Subject: Programmatic Synonyms Filter (Lucene and/or Solr) >> >> Hi >> >> I was asked to integrate with a system which provides synonyms for words >> through API. I checked the existing synonym filters in Lucene and Solr and >> they all seem to take a synonyms map up front. >> >> E.g. Lucene's SynonymFilter takes a SynonymMap which exposes an FST, so it's >> not really programmatic in the sense that I can provide an impl which will >> pull the synonyms through the other system's API. >> >> Solr SynonymFilterFactory just loads the synonyms from a file into a >> SynonymMap, and then uses Lucene's SynonymFilter, so it doesn't look like I >> can extend that one either. >> >> The problem is that the synonyms DB I should integrate with is HUGE and will >> probably not fit in RAM (SynonymMap). Nor is it currently possible to pull >> all available synonyms from it in one go. The API I have is something like >> String[] getSynonyms(String word). >> >> So I have few questions: >> >> 1) Did I miss a Filter which does take a programmatic syn-map which I can >> provide my own impl to? >> >> 2) If not, Would it make sense to modify SynonymMap to offer >> getSynonyms(word) API (using BytesRef / CharsRef of course), with an >> FSTSynonymMap default impl so that users can provide their own impl, e.g. >> not requiring everything to be in RAM? >> >> 2.1) Side-effect benefit, I think, is that we won't require everyone to deal >> with the FST API that way, though I'll admit I cannot think of may use cases >> for not using SynonymFilter as-is ... >> >> 3) If the answer to (1) and (2) is NO, I guess my only option is to >> implement my own SynonymFilter, copying most of the code from Lucene's ... >> right? >> >> Shai >> > > -- > Walter Underwood > wun...@wunderwood.org > > > -- Walter Underwood wun...@wunderwood.org