RE: Programmatic Synonyms Filter (Lucene and/or Solr)

Uwe Schindler Thu, 18 Jul 2013 10:18:42 -0700

I think Jack is implicitely referring to Solr. In the case of a pure Lucene 
application without Solr or a custom query parser plugged into Solr that does 
the query-time expansion, the limit is not the URL length (which only applies 
to Solr as the query is part of the URL), but more the fact that Lucene refuses 
to run with more than 1024 BQ clauses J


 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Thursday, July 18, 2013 4:05 PM
To: dev@lucene.apache.org
Subject: Re: Programmatic Synonyms Filter (Lucene and/or Solr)

 

Container (e.g., Tomcat) limit. Configurable. I don’t recall the specifics.


-- Jack Krupansky

 

From: Shai Erera <mailto:ser...@gmail.com>  

Sent: Thursday, July 18, 2013 9:46 AM

To: dev@lucene.apache.org 

Subject: Re: Programmatic Synonyms Filter (Lucene and/or Solr)

 

Actually, after chatting w/ Mike about it, he made a good point about making 
SynMap expose API like lookup(word), because that doesn't work with multi-word 
synonyms (e.g. "wi fi" -> "wifi"). So I no longer think we should change 
SynFilter. Since in my case it's 1:1 (so much I learned so far), I should write 
my own TokenFilter.

So now the question is whether to do it at indexing time or search time. Each 
has pros and cons. I'll need to learn more about the DB first, e.g. how many 
words have only tens of synonyms and how many thousands. I suspect there's no 
single solution here, so will need to experiment with both.

Jack, I didn't quite follow the 2048 common limit -- is it a Solr limit of some 
sort? If so, can you please elaborate?

Shai

 

On Thu, Jul 18, 2013 at 4:12 PM, Jack Krupansky <j...@basetechnology.com> wrote:

Maybe a custom search component would be in order, to “enrich” the incoming 
query. Again, preprocessing the query for synonym expansion before Solr parses 
it. It could call the external synonym API and cache synonyms as well.

 

But, I’d still lean towards preprocessing in an application layer. Although, 
for hundreds or thousands of synonyms it would probably hit the 2048 common 
limit for URLs in some containers, which would need to be raised.


-- Jack Krupansky

 

From: Shai Erera <mailto:ser...@gmail.com>  

Sent: Thursday, July 18, 2013 8:54 AM

To: dev@lucene.apache.org 

Subject: Re: Programmatic Synonyms Filter (Lucene and/or Solr)

 

The examples I've seen so far are single words. But I learned today something 
new .. the number of "synonyms" returned for a word may be in the range of 
hundreds, sometimes even thousands.

So I'm not sure query-time synonyms may work at all .. what do you think?

Shai

 

On Thu, Jul 18, 2013 at 3:21 PM, Jack Krupansky <j...@basetechnology.com> wrote:

Your best bet is to preprocess queries and expand synonyms in your own 
application layer. The Lucene/Solr synonym implementation, design, and 
architecture is fairly lightweight (although FST is a big improvement) and not 
architected for large and dynamic synonym sets.

 

Do you need multi-word phrase synonyms as well, or is this strictly single-word 
synonyms?


-- Jack Krupansky

 

From: Shai Erera <mailto:ser...@gmail.com>  

Sent: Thursday, July 18, 2013 1:36 AM

To: dev@lucene.apache.org 

Subject: Programmatic Synonyms Filter (Lucene and/or Solr)

 

Hi

I was asked to integrate with a system which provides synonyms for words 
through API. I checked the existing synonym filters in Lucene and Solr and they 
all seem to take a synonyms map up front. 

E.g. Lucene's SynonymFilter takes a SynonymMap which exposes an FST, so it's 
not really programmatic in the sense that I can provide an impl which will pull 
the synonyms through the other system's API.

Solr SynonymFilterFactory just loads the synonyms from a file into a 
SynonymMap, and then uses Lucene's SynonymFilter, so it doesn't look like I can 
extend that one either.

The problem is that the synonyms DB I should integrate with is HUGE and will 
probably not fit in RAM (SynonymMap). Nor is it currently possible to pull all 
available synonyms from it in one go. The API I have is something like String[] 
getSynonyms(String word).

So I have few questions:

1) Did I miss a Filter which does take a programmatic syn-map which I can 
provide my own impl to?

2) If not, Would it make sense to modify SynonymMap to offer getSynonyms(word) 
API (using BytesRef / CharsRef of course), with an FSTSynonymMap default impl 
so that users can provide their own impl, e.g. not requiring everything to be 
in RAM?

2.1) Side-effect benefit, I think, is that we won't require everyone to deal 
with the FST API that way, though I'll admit I cannot think of may use cases 
for not using SynonymFilter as-is ...

 

3) If the answer to (1) and (2) is NO, I guess my only option is to implement 
my own SynonymFilter, copying most of the code from Lucene's ... right?

Shai

RE: Programmatic Synonyms Filter (Lucene and/or Solr)

Reply via email to