[ 
https://issues.apache.org/jira/browse/SOLR-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201002#comment-13201002
 ] 

Mike commented on SOLR-2866:
----------------------------

Hi. FYI, I've created a new issue, SOLR-3099, that is requesting that this 
feature be supported in the index and the edismax parser. I don't *think* the 
overlap is huge, but that seemed like a better approach to me, so I've created 
a branch of the conversation over there. 
                
> Marked synonym filter for selective token expansion
> ---------------------------------------------------
>
>                 Key: SOLR-2866
>                 URL: https://issues.apache.org/jira/browse/SOLR-2866
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>         Environment: Solr 3.4
>            Reporter: Victor van der Wolf
>            Priority: Minor
>              Labels: stemming, synonyms
>             Fix For: 3.6
>
>         Attachments: MarkedSynonymFilterFactory.java, 
> SlowMarkedSynonymFilter.java, SlowMarkedSynonymFilterFactory.java
>
>
> Hi everybody,
> My name is Victor van der Wolf and since recently I work for the Royal 
> Library in the Netherlands. One of my first assignments here was to see if I 
> could implement some stemming algorithm for our websites. Our search engine 
> is solr/lucene 3.4.
> Basically I had 2 requirements to work with:
> 1)       It should be possible to switch the stemming functionality on and 
> off in the front end
> 2)       No extra storage should be required (no extra indexing).
> I shortly came to the conclusion that it would be practical to use the 
> SynonymFilter to do that. I got hold of a dutch library and used a stemming 
> algorithm to generate a synonym file on that.
> Then I thought that I could maybe use 2 different query analyzers under the 
> "field type" and then call one or the other depending if I want stemming or 
> not, like this q=<field>:<analyzer>:<search term>. Unfortunately this did not 
> seem possible.
> Then, after some discussions with Erick Erickson, it became clear that a good 
> approach could be to write my own SynonymFilter and apply some kind of token 
> marking to decide it that token should be "synonymized" or not. Well, I did 
> just that and it works like a charm.
> I would like to contribute this MarkedSynonymFilter class to the project.
> I used the SynonymFilter class as a starting point and added some extra 
> functionality to that. First of all, I added 3 new parameters called lookup, 
> preMark and postmark. The preMark and postmark parameters contain some kind 
> of pre- and suffix to recognize if a token should be "synonymized" or not. A 
> simple regex is used to determine this. Then the lookup parameter determines 
> the behaviour of the MarkedSynonymFilter:
> lookup=marked - marked tokens will be synonymized
> lookup=unmarked - unmarked tokens will be synonymized
> lookup=all - all tokens should be synonymized
> lookup=none - none of the tokens should be synonymized
> I started out writing this based on version 3.3, later I discovered that we 
> were using 3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter 
> code has been revised and for the moment there is the Slow and the Fast 
> synonym filter where the Slow one if depricated. My addition is based on the 
> slow version I am afraid.
> Anyway, I am curious about your comments. Please let me know if I should go 
> forward with this and create a JIRA issue + my code as a patch.
> Cheers,
> Victor van der Wolf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to