[
https://issues.apache.org/jira/browse/SOLR-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated SOLR-2866:
-------------------------------
Fix Version/s: (was: 4.7)
4.8
> Marked synonym filter for selective token expansion
> ---------------------------------------------------
>
> Key: SOLR-2866
> URL: https://issues.apache.org/jira/browse/SOLR-2866
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis
> Environment: Solr 3.4
> Reporter: Victor van der Wolf
> Priority: Minor
> Labels: stemming, synonyms
> Fix For: 4.8
>
> Attachments: MarkedSynonymFilterFactory.java,
> SlowMarkedSynonymFilter.java, SlowMarkedSynonymFilterFactory.java
>
>
> Hi everybody,
> My name is Victor van der Wolf and since recently I work for the Royal
> Library in the Netherlands. One of my first assignments here was to see if I
> could implement some stemming algorithm for our websites. Our search engine
> is solr/lucene 3.4.
> Basically I had 2 requirements to work with:
> 1) It should be possible to switch the stemming functionality on and
> off in the front end
> 2) No extra storage should be required (no extra indexing).
> I shortly came to the conclusion that it would be practical to use the
> SynonymFilter to do that. I got hold of a dutch library and used a stemming
> algorithm to generate a synonym file on that.
> Then I thought that I could maybe use 2 different query analyzers under the
> "field type" and then call one or the other depending if I want stemming or
> not, like this q=<field>:<analyzer>:<search term>. Unfortunately this did not
> seem possible.
> Then, after some discussions with Erick Erickson, it became clear that a good
> approach could be to write my own SynonymFilter and apply some kind of token
> marking to decide it that token should be "synonymized" or not. Well, I did
> just that and it works like a charm.
> I would like to contribute this MarkedSynonymFilter class to the project.
> I used the SynonymFilter class as a starting point and added some extra
> functionality to that. First of all, I added 3 new parameters called lookup,
> preMark and postmark. The preMark and postmark parameters contain some kind
> of pre- and suffix to recognize if a token should be "synonymized" or not. A
> simple regex is used to determine this. Then the lookup parameter determines
> the behaviour of the MarkedSynonymFilter:
> lookup=marked - marked tokens will be synonymized
> lookup=unmarked - unmarked tokens will be synonymized
> lookup=all - all tokens should be synonymized
> lookup=none - none of the tokens should be synonymized
> I started out writing this based on version 3.3, later I discovered that we
> were using 3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter
> code has been revised and for the moment there is the Slow and the Fast
> synonym filter where the Slow one if depricated. My addition is based on the
> slow version I am afraid.
> Anyway, I am curious about your comments. Please let me know if I should go
> forward with this and create a JIRA issue + my code as a patch.
> Cheers,
> Victor van der Wolf
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]