[jira] [Commented] (SOLR-2866) Marked synonym filter for selective token expansion

Mike (Commented) (JIRA) Sun, 08 Jan 2012 17:16:03 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182338#comment-13182338
 ]


Mike commented on SOLR-2866:
----------------------------

This seems like a strange solution to me. The main problem I see is that this 
requires a huge synonym file, which has to be created before the filter can be 
used, and which will inevitably fail some of the time.

If I understand this correctly, I see three parts to the solution:
1. We need a new flag when indexing. Something like stems=True on the string 
type. When this flag is enabled, the index stores the stemmed versions of terms 
in addition to the unstemmed versions.
2. On the query side, a new operator is needed, as is mentioned by Robert Muir. 
In Sphinx search, they use the equals sign (=), so that queries like ="signing 
agreement" or =signing can be made. The query parser can then identify the 
operator, and decide which word map to use, stemmed or not.
3. The "lookup" parameter makes sense to include as well, though I'd suggest we 
call it exactMatch instead, if possible. I don't see the value in the 
"unmarked" option though. How is this different than "all"? 

This is probably a more complicated solution than the one proposed, and I'm 
fairly new to Solr, but I'd hate to see a solution involving long text files 
land, and for the correct solution to be put off as a result (though I know 
this is code we have *now*).

A possibly-related issue is SOLR-1980, which is implementing "boundary match 
support". Almost seems like that feature could do double duty as exact match 
somehow (haven't thought that entirely through though).
                
> Marked synonym filter for selective token expansion
> ---------------------------------------------------
>
>                 Key: SOLR-2866
>                 URL: https://issues.apache.org/jira/browse/SOLR-2866
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>         Environment: Solr 3.4
>            Reporter: Victor van der Wolf
>            Priority: Minor
>              Labels: stemming, synonyms
>             Fix For: 3.6
>
>         Attachments: MarkedSynonymFilterFactory.java, 
> SlowMarkedSynonymFilter.java, SlowMarkedSynonymFilterFactory.java
>
>
> Hi everybody,
> My name is Victor van der Wolf and since recently I work for the Royal 
> Library in the Netherlands. One of my first assignments here was to see if I 
> could implement some stemming algorithm for our websites. Our search engine 
> is solr/lucene 3.4.
> Basically I had 2 requirements to work with:
> 1)       It should be possible to switch the stemming functionality on and 
> off in the front end
> 2)       No extra storage should be required (no extra indexing).
> I shortly came to the conclusion that it would be practical to use the 
> SynonymFilter to do that. I got hold of a dutch library and used a stemming 
> algorithm to generate a synonym file on that.
> Then I thought that I could maybe use 2 different query analyzers under the 
> "field type" and then call one or the other depending if I want stemming or 
> not, like this q=<field>:<analyzer>:<search term>. Unfortunately this did not 
> seem possible.
> Then, after some discussions with Erick Erickson, it became clear that a good 
> approach could be to write my own SynonymFilter and apply some kind of token 
> marking to decide it that token should be "synonymized" or not. Well, I did 
> just that and it works like a charm.
> I would like to contribute this MarkedSynonymFilter class to the project.
> I used the SynonymFilter class as a starting point and added some extra 
> functionality to that. First of all, I added 3 new parameters called lookup, 
> preMark and postmark. The preMark and postmark parameters contain some kind 
> of pre- and suffix to recognize if a token should be "synonymized" or not. A 
> simple regex is used to determine this. Then the lookup parameter determines 
> the behaviour of the MarkedSynonymFilter:
> lookup=marked - marked tokens will be synonymized
> lookup=unmarked - unmarked tokens will be synonymized
> lookup=all - all tokens should be synonymized
> lookup=none - none of the tokens should be synonymized
> I started out writing this based on version 3.3, later I discovered that we 
> were using 3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter 
> code has been revised and for the moment there is the Slow and the Fast 
> synonym filter where the Slow one if depricated. My addition is based on the 
> slow version I am afraid.
> Anyway, I am curious about your comments. Please let me know if I should go 
> forward with this and create a JIRA issue + my code as a patch.
> Cheers,
> Victor van der Wolf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-2866) Marked synonym filter for selective token expansion

Reply via email to