[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

Chris A. Mattmann (JIRA) Fri, 28 Dec 2012 21:22:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540773#comment-13540773
 ]


Chris A. Mattmann commented on LUCENE-3413:
-------------------------------------------

Thanks for the comments Robert. I'll take a pass at updating the patch per your 
comments. Lance, I *think* I get what you're saying. This is now in production 
at a fairly large company that I was doing consulting for and is working fine 
for their titles, etc, so I think it's still pretty useful.
                
> CombiningFilter to recombine tokens into a single token for sorting
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3413
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3413
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 2.9.3
>            Reporter: Chris A. Mattmann
>            Priority: Minor
>         Attachments: LUCENE-3413.Mattmann.090311.patch.txt, 
> LUCENE-3413.Mattmann.090511.patch.txt
>
>
> I whipped up this CombiningFilter for the following use case:
> I've got a bunch of titles of e.g., Books, such as:
> The Grapes of Wrath
> Tommy Tommerson saves the World
> Top of the World
> The Tales of Beedle the Bard
> Born Free
> etc.
> I want to sort these titles using a String field that includes stopword 
> analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), 
> etc. I created an analysis chain in Solr for this that was based off of 
> *alphaOnlySort*, which looks like this:
> {code:xml}
> <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" 
> omitNorms="true">
>    <analyzer>
>         <!-- KeywordTokenizer does no actual tokenizing, so the entire
>              input string is preserved as a single token
>           -->
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <!-- The LowerCase TokenFilter does what you expect, which can be
>              when you want your sorting to be case insensitive
>           -->
>         <filter class="solr.LowerCaseFilterFactory" />
>         <!-- The TrimFilter removes any leading or trailing whitespace -->
>         <filter class="solr.TrimFilterFactory" />
>         <!-- The PatternReplaceFilter gives you the flexibility to use
>              Java Regular expression to replace any sequence of characters
>              matching a pattern with an arbitrary replacement string, 
>              which may include back references to portions of the original
>              string matched by the pattern.
>              
>              See the Java Regular Expression documentation for more
>              information on pattern and replacement string syntax.
>              
>              
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
>           -->
>         <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="([^a-z])" replacement="" replace="all"
>         /> 
>     </analyzer>       
>     </fieldType>
> {code}
> The issue with alphaOnlySort is that it doesn't support stopword remove or 
> synonyms because those are based on the original token level instead of the 
> full strings produced by the KeywordTokenizer (which does not do 
> tokenization). I needed a filter that would allow me to change alphaOnlySort 
> and its analysis chain from using KeywordTokenizer to using 
> WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, 
> take "The Grapes of Wrath". I needed a way for it to get turned into:
> {noformat}
> grapes of wrath
> {noformat}
> And then to combine those tokens into a single token:
> {noformat}
> grapesofwrath
> {noformat}
> The attached CombiningFilter takes care of that. It doesn't do it super 
> efficiently I'm guessing (since I used a StringBuffer), but I'm open to 
> suggestions on how to make it better. 
> One other thing is that apparently this analyzer works fine for analysis 
> (e.g., it produces the desired tokens), however, for sorting in Solr I'm 
> getting null sort tokens. Need to figure out why. 
> Here ya go!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

Reply via email to