[
https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540773#comment-13540773
]
Chris A. Mattmann commented on LUCENE-3413:
-------------------------------------------
Thanks for the comments Robert. I'll take a pass at updating the patch per your
comments. Lance, I *think* I get what you're saying. This is now in production
at a fairly large company that I was doing consulting for and is working fine
for their titles, etc, so I think it's still pretty useful.
> CombiningFilter to recombine tokens into a single token for sorting
> -------------------------------------------------------------------
>
> Key: LUCENE-3413
> URL: https://issues.apache.org/jira/browse/LUCENE-3413
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 2.9.3
> Reporter: Chris A. Mattmann
> Priority: Minor
> Attachments: LUCENE-3413.Mattmann.090311.patch.txt,
> LUCENE-3413.Mattmann.090511.patch.txt
>
>
> I whipped up this CombiningFilter for the following use case:
> I've got a bunch of titles of e.g., Books, such as:
> The Grapes of Wrath
> Tommy Tommerson saves the World
> Top of the World
> The Tales of Beedle the Bard
> Born Free
> etc.
> I want to sort these titles using a String field that includes stopword
> analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping),
> etc. I created an analysis chain in Solr for this that was based off of
> *alphaOnlySort*, which looks like this:
> {code:xml}
> <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true"
> omitNorms="true">
> <analyzer>
> <!-- KeywordTokenizer does no actual tokenizing, so the entire
> input string is preserved as a single token
> -->
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <!-- The LowerCase TokenFilter does what you expect, which can be
> when you want your sorting to be case insensitive
> -->
> <filter class="solr.LowerCaseFilterFactory" />
> <!-- The TrimFilter removes any leading or trailing whitespace -->
> <filter class="solr.TrimFilterFactory" />
> <!-- The PatternReplaceFilter gives you the flexibility to use
> Java Regular expression to replace any sequence of characters
> matching a pattern with an arbitrary replacement string,
> which may include back references to portions of the original
> string matched by the pattern.
>
> See the Java Regular Expression documentation for more
> information on pattern and replacement string syntax.
>
>
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
> -->
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])" replacement="" replace="all"
> />
> </analyzer>
> </fieldType>
> {code}
> The issue with alphaOnlySort is that it doesn't support stopword remove or
> synonyms because those are based on the original token level instead of the
> full strings produced by the KeywordTokenizer (which does not do
> tokenization). I needed a filter that would allow me to change alphaOnlySort
> and its analysis chain from using KeywordTokenizer to using
> WhitespaceTokenizer, and then a way to recombine the tokens at the end. So,
> take "The Grapes of Wrath". I needed a way for it to get turned into:
> {noformat}
> grapes of wrath
> {noformat}
> And then to combine those tokens into a single token:
> {noformat}
> grapesofwrath
> {noformat}
> The attached CombiningFilter takes care of that. It doesn't do it super
> efficiently I'm guessing (since I used a StringBuffer), but I'm open to
> suggestions on how to make it better.
> One other thing is that apparently this analyzer works fine for analysis
> (e.g., it produces the desired tokens), however, for sorting in Solr I'm
> getting null sort tokens. Need to figure out why.
> Here ya go!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]