Re: Programmatic Synonyms Filter (Lucene and/or Solr)

Walter Underwood Thu, 18 Jul 2013 10:14:36 -0700

Adding terms to posting lists is about the most space-efficient thing you can 
do in a search engine, so I would not worry too much about that.


wunder

On Jul 18, 2013, at 10:06 AM, Shai Erera wrote:

> We index time synonyms means you bloat the index with a lot of new postings, 
> most of them are just duplicates of each other. And in my case, cause for 
> every synonym there's a weight, I cannot even consider postings 
> deduplication...
> 
> There's a tradeoff here (as usual). Both approaches have pros and cons. I 
> think index time is better in the end because a larger index can be solved by 
> throwing more hardware at it. But queries with thousands of terms are a real 
> issue.
> 
> One thing I can look at is if the synonym sets can be 'grouped' in a way that 
> instead of all the terms I index a group ID or something and then during 
> search i resolve a term to all the groups it may belong to... I'll need to 
> think about it more.
> 
> On Jul 18, 2013 7:49 PM, "Walter Underwood" <wun...@wunderwood.org> wrote:
> There are two serious issues with query-time synonyms, speed and correctness.
> 
> 1. Expanding a term to 1000 synonyms at query time means 1000 term lookups. 
> This will not be fast. Expanding the term at index time means 1000 posting 
> list entries, but only one term lookup at query time.
> 
> 2. Query time expansion will give higher scores to the more rare synonyms. 
> This is almost never what you want. If I make "TV" and "television" synonyms, 
> I want them both to score the same. But if TV is 10X more common than 
> television, then documents with the rare term (television) will score better.
> 
> wunder
> 
> On Jul 18, 2013, at 5:54 AM, Shai Erera wrote:
> 
>> The examples I've seen so far are single words. But I learned today 
>> something new .. the number of "synonyms" returned for a word may be in the 
>> range of hundreds, sometimes even thousands.
>> So I'm not sure query-time synonyms may work at all .. what do you think?
>> 
>> Shai
>> 
>> 
>> On Thu, Jul 18, 2013 at 3:21 PM, Jack Krupansky <j...@basetechnology.com> 
>> wrote:
>> Your best bet is to preprocess queries and expand synonyms in your own 
>> application layer. The Lucene/Solr synonym implementation, design, and 
>> architecture is fairly lightweight (although FST is a big improvement) and 
>> not architected for large and dynamic synonym sets.
>>  
>> Do you need multi-word phrase synonyms as well, or is this strictly 
>> single-word synonyms?
>> 
>> -- Jack Krupansky
>>  
>> From: Shai Erera
>> Sent: Thursday, July 18, 2013 1:36 AM
>> To: dev@lucene.apache.org
>> Subject: Programmatic Synonyms Filter (Lucene and/or Solr)
>>  
>> Hi
>> 
>> I was asked to integrate with a system which provides synonyms for words 
>> through API. I checked the existing synonym filters in Lucene and Solr and 
>> they all seem to take a synonyms map up front. 
>> 
>> E.g. Lucene's SynonymFilter takes a SynonymMap which exposes an FST, so it's 
>> not really programmatic in the sense that I can provide an impl which will 
>> pull the synonyms through the other system's API.
>> 
>> Solr SynonymFilterFactory just loads the synonyms from a file into a 
>> SynonymMap, and then uses Lucene's SynonymFilter, so it doesn't look like I 
>> can extend that one either.
>> 
>> The problem is that the synonyms DB I should integrate with is HUGE and will 
>> probably not fit in RAM (SynonymMap). Nor is it currently possible to pull 
>> all available synonyms from it in one go. The API I have is something like 
>> String[] getSynonyms(String word).
>> 
>> So I have few questions:
>> 
>> 1) Did I miss a Filter which does take a programmatic syn-map which I can 
>> provide my own impl to?
>> 
>> 2) If not, Would it make sense to modify SynonymMap to offer 
>> getSynonyms(word) API (using BytesRef / CharsRef of course), with an 
>> FSTSynonymMap default impl so that users can provide their own impl, e.g. 
>> not requiring everything to be in RAM?
>> 
>> 2.1) Side-effect benefit, I think, is that we won't require everyone to deal 
>> with the FST API that way, though I'll admit I cannot think of may use cases 
>> for not using SynonymFilter as-is ...
>>  
>> 3) If the answer to (1) and (2) is NO, I guess my only option is to 
>> implement my own SynonymFilter, copying most of the code from Lucene's ... 
>> right?
>> 
>> Shai
>> 
> 
> --
> Walter Underwood
> wun...@wunderwood.org
> 
> 
> 

--
Walter Underwood
wun...@wunderwood.org

Re: Programmatic Synonyms Filter (Lucene and/or Solr)

Reply via email to