[jira] [Commented] (SOLR-2866) Marked synonym filter for selective token expansion

Steven Rowe (Commented) (JIRA) Mon, 31 Oct 2011 09:29:57 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140274#comment-13140274
 ]


Steven Rowe commented on SOLR-2866:
-----------------------------------

This sounds interesting.

Some questions:

# Couldn’t you combine and generalize the Premark and Postmark options, to be a 
single regular expression option?  Full regex capability, including matching on 
things that are neither prefix nor postfix, would increase the usefulness of 
this filter.
# Under what circumstances would {{lookup=all}} and {{lookup=none}} option 
values be useful?  Seems like you could just use SynonymFilter (for 
{{lookup=all}}) or no filter at all (for {{lookup=none}}) instead?  Looking at 
your code, this question becomes: why do you need the {{marked}} parameter to 
the filter ctor?  If that were eliminated, the {{invert}} option would be 
sufficient to enable {{lookup=marked}} and {{lookup=unmarked}}.

Also, a naming question: "Marked" to me implies a separate process that only 
adds a mark for later processing, but I think you mean something like 
"matching" instead?  My suggestion: SelectiveSynonymFilter.

One more naming issue: I don't think "Slow" should be part of the class names.
                
> Marked synonym filter for selective token expansion
> ---------------------------------------------------
>
>                 Key: SOLR-2866
>                 URL: https://issues.apache.org/jira/browse/SOLR-2866
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>         Environment: Solr 3.4
>            Reporter: Victor van der Wolf
>            Priority: Minor
>              Labels: stemming, synonyms
>             Fix For: 3.5
>
>         Attachments: MarkedSynonymFilterFactory.java, 
> SlowMarkedSynonymFilter.java, SlowMarkedSynonymFilterFactory.java
>
>
> Hi everybody,
> My name is Victor van der Wolf and since recently I work for the Royal 
> Library in the Netherlands. One of my first assignments here was to see if I 
> could implement some stemming algorithm for our websites. Our search engine 
> is solr/lucene 3.4.
> Basically I had 2 requirements to work with:
> 1)       It should be possible to switch the stemming functionality on and 
> off in the front end
> 2)       No extra storage should be required (no extra indexing).
> I shortly came to the conclusion that it would be practical to use the 
> SynonymFilter to do that. I got hold of a dutch library and used a stemming 
> algorithm to generate a synonym file on that.
> Then I thought that I could maybe use 2 different query analyzers under the 
> "field type" and then call one or the other depending if I want stemming or 
> not, like this q=<field>:<analyzer>:<search term>. Unfortunately this did not 
> seem possible.
> Then, after some discussions with Erick Erickson, it became clear that a good 
> approach could be to write my own SynonymFilter and apply some kind of token 
> marking to decide it that token should be "synonymized" or not. Well, I did 
> just that and it works like a charm.
> I would like to contribute this MarkedSynonymFilter class to the project.
> I used the SynonymFilter class as a starting point and added some extra 
> functionality to that. First of all, I added 3 new parameters called lookup, 
> preMark and postmark. The preMark and postmark parameters contain some kind 
> of pre- and suffix to recognize if a token should be "synonymized" or not. A 
> simple regex is used to determine this. Then the lookup parameter determines 
> the behaviour of the MarkedSynonymFilter:
> lookup=marked - marked tokens will be synonymized
> lookup=unmarked - unmarked tokens will be synonymized
> lookup=all - all tokens should be synonymized
> lookup=none - none of the tokens should be synonymized
> I started out writing this based on version 3.3, later I discovered that we 
> were using 3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter 
> code has been revised and for the moment there is the Slow and the Fast 
> synonym filter where the Slow one if depricated. My addition is based on the 
> slow version I am afraid.
> Anyway, I am curious about your comments. Please let me know if I should go 
> forward with this and create a JIRA issue + my code as a patch.
> Cheers,
> Victor van der Wolf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-2866) Marked synonym filter for selective token expansion

Reply via email to