Hi everybody,

 

My name is Victor van der Wolf and since recently I work for the Royal Library 
in the Netherlands. One of my first assignments here was to see if I could 
implement some stemming algorithm for our websites. Our search engine is 
solr/lucene 3.4.

 

Basically I had 2 requirements to work with:

 

1)       It should be possible to switch the stemming functionality on and off 
in the front end

2)       No extra storage should be required (no extra indexing).

 

I shortly came to the conclusion that it would be practical to use the 
SynonymFilter to do that. I got hold of a dutch library and used a stemming 
algorithm to generate a synonym file on that.

 

Then I thought that I could maybe use 2 different query analyzers under the 
"field type" and then call one or the other depending if I want stemming or 
not, like this q=<field>:<analyzer>:<search term>. Unfortunately this did not 
seem possible.

 

Then, after some discussions with Erick Erickson, it became clear that a good 
approach could be to write my own SynonymFilter and apply some kind of token 
marking to decide it that token should be "synonymized" or not. Well, I did 
just that and it works like a charm.

 

I would like to contribute this MarkedSynonymFilter class to the project.

 

I used the SynonymFilter class as a starting point and added some extra 
functionality to that. First of all, I added 3 new parameters called lookup, 
preMark and postmark. The preMark and postmark parameters contain some kind of 
pre- and suffix to recognize if a token should be "synonymized" or not. A 
simple regex is used to determine this. Then the lookup parameter determines 
the behaviour of the MarkedSynonymFilter:

 

lookup=marked --> marked tokens will be synonymized

lookup=unmarked --> unmarked tokens will be synonymized

lookup=all --> all tokens should be synonymized

lookup=none --> none of the tokens should be synonymized

 

I started out writing this based on version 3.3, later I discovered that we 
were using 3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter 
code has been revised and for the moment there is the Slow and the Fast synonym 
filter where the Slow one if depricated. My addition is based on the slow 
version I am afraid ...

 

Anyway, I am curious about your comments. Please let me know if I should go 
forward with this and create a JIRA issue + my code as a patch.

 

Cheers,

Victor van der Wolf

 

Reply via email to