[
https://issues.apache.org/jira/browse/SOLR-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140274#comment-13140274
]
Steven Rowe commented on SOLR-2866:
-----------------------------------
This sounds interesting.
Some questions:
# Couldn’t you combine and generalize the Premark and Postmark options, to be a
single regular expression option? Full regex capability, including matching on
things that are neither prefix nor postfix, would increase the usefulness of
this filter.
# Under what circumstances would {{lookup=all}} and {{lookup=none}} option
values be useful? Seems like you could just use SynonymFilter (for
{{lookup=all}}) or no filter at all (for {{lookup=none}}) instead? Looking at
your code, this question becomes: why do you need the {{marked}} parameter to
the filter ctor? If that were eliminated, the {{invert}} option would be
sufficient to enable {{lookup=marked}} and {{lookup=unmarked}}.
Also, a naming question: "Marked" to me implies a separate process that only
adds a mark for later processing, but I think you mean something like
"matching" instead? My suggestion: SelectiveSynonymFilter.
One more naming issue: I don't think "Slow" should be part of the class names.
> Marked synonym filter for selective token expansion
> ---------------------------------------------------
>
> Key: SOLR-2866
> URL: https://issues.apache.org/jira/browse/SOLR-2866
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis
> Environment: Solr 3.4
> Reporter: Victor van der Wolf
> Priority: Minor
> Labels: stemming, synonyms
> Fix For: 3.5
>
> Attachments: MarkedSynonymFilterFactory.java,
> SlowMarkedSynonymFilter.java, SlowMarkedSynonymFilterFactory.java
>
>
> Hi everybody,
> My name is Victor van der Wolf and since recently I work for the Royal
> Library in the Netherlands. One of my first assignments here was to see if I
> could implement some stemming algorithm for our websites. Our search engine
> is solr/lucene 3.4.
> Basically I had 2 requirements to work with:
> 1) It should be possible to switch the stemming functionality on and
> off in the front end
> 2) No extra storage should be required (no extra indexing).
> I shortly came to the conclusion that it would be practical to use the
> SynonymFilter to do that. I got hold of a dutch library and used a stemming
> algorithm to generate a synonym file on that.
> Then I thought that I could maybe use 2 different query analyzers under the
> "field type" and then call one or the other depending if I want stemming or
> not, like this q=<field>:<analyzer>:<search term>. Unfortunately this did not
> seem possible.
> Then, after some discussions with Erick Erickson, it became clear that a good
> approach could be to write my own SynonymFilter and apply some kind of token
> marking to decide it that token should be "synonymized" or not. Well, I did
> just that and it works like a charm.
> I would like to contribute this MarkedSynonymFilter class to the project.
> I used the SynonymFilter class as a starting point and added some extra
> functionality to that. First of all, I added 3 new parameters called lookup,
> preMark and postmark. The preMark and postmark parameters contain some kind
> of pre- and suffix to recognize if a token should be "synonymized" or not. A
> simple regex is used to determine this. Then the lookup parameter determines
> the behaviour of the MarkedSynonymFilter:
> lookup=marked - marked tokens will be synonymized
> lookup=unmarked - unmarked tokens will be synonymized
> lookup=all - all tokens should be synonymized
> lookup=none - none of the tokens should be synonymized
> I started out writing this based on version 3.3, later I discovered that we
> were using 3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter
> code has been revised and for the moment there is the Slow and the Fast
> synonym filter where the Slow one if depricated. My addition is based on the
> slow version I am afraid.
> Anyway, I am curious about your comments. Please let me know if I should go
> forward with this and create a JIRA issue + my code as a patch.
> Cheers,
> Victor van der Wolf
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]