[
https://issues.apache.org/jira/browse/SOLR-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201002#comment-13201002
]
Mike commented on SOLR-2866:
----------------------------
Hi. FYI, I've created a new issue, SOLR-3099, that is requesting that this
feature be supported in the index and the edismax parser. I don't *think* the
overlap is huge, but that seemed like a better approach to me, so I've created
a branch of the conversation over there.
> Marked synonym filter for selective token expansion
> ---------------------------------------------------
>
> Key: SOLR-2866
> URL: https://issues.apache.org/jira/browse/SOLR-2866
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis
> Environment: Solr 3.4
> Reporter: Victor van der Wolf
> Priority: Minor
> Labels: stemming, synonyms
> Fix For: 3.6
>
> Attachments: MarkedSynonymFilterFactory.java,
> SlowMarkedSynonymFilter.java, SlowMarkedSynonymFilterFactory.java
>
>
> Hi everybody,
> My name is Victor van der Wolf and since recently I work for the Royal
> Library in the Netherlands. One of my first assignments here was to see if I
> could implement some stemming algorithm for our websites. Our search engine
> is solr/lucene 3.4.
> Basically I had 2 requirements to work with:
> 1) It should be possible to switch the stemming functionality on and
> off in the front end
> 2) No extra storage should be required (no extra indexing).
> I shortly came to the conclusion that it would be practical to use the
> SynonymFilter to do that. I got hold of a dutch library and used a stemming
> algorithm to generate a synonym file on that.
> Then I thought that I could maybe use 2 different query analyzers under the
> "field type" and then call one or the other depending if I want stemming or
> not, like this q=<field>:<analyzer>:<search term>. Unfortunately this did not
> seem possible.
> Then, after some discussions with Erick Erickson, it became clear that a good
> approach could be to write my own SynonymFilter and apply some kind of token
> marking to decide it that token should be "synonymized" or not. Well, I did
> just that and it works like a charm.
> I would like to contribute this MarkedSynonymFilter class to the project.
> I used the SynonymFilter class as a starting point and added some extra
> functionality to that. First of all, I added 3 new parameters called lookup,
> preMark and postmark. The preMark and postmark parameters contain some kind
> of pre- and suffix to recognize if a token should be "synonymized" or not. A
> simple regex is used to determine this. Then the lookup parameter determines
> the behaviour of the MarkedSynonymFilter:
> lookup=marked - marked tokens will be synonymized
> lookup=unmarked - unmarked tokens will be synonymized
> lookup=all - all tokens should be synonymized
> lookup=none - none of the tokens should be synonymized
> I started out writing this based on version 3.3, later I discovered that we
> were using 3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter
> code has been revised and for the moment there is the Slow and the Fast
> synonym filter where the Slow one if depricated. My addition is based on the
> slow version I am afraid.
> Anyway, I am curious about your comments. Please let me know if I should go
> forward with this and create a JIRA issue + my code as a patch.
> Cheers,
> Victor van der Wolf
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]