[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

Chris A. Mattmann (JIRA) Sat, 03 Sep 2011 23:58:58 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096829#comment-13096829
 ]


Chris A. Mattmann commented on LUCENE-3413:
-------------------------------------------

Thanks guys! Your updates fixed it! It's not sorting correctly!

I'll prepare two patches. One for Lucene that implements your suggestions. And 
another for Solr (containing the super trivial factory to instantiate this).

Thanks, again!

> CombiningFilter to recombine tokens into a single token for sorting
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3413
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3413
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 2.9.3
>            Reporter: Chris A. Mattmann
>            Priority: Minor
>         Attachments: LUCENE-3413.Mattmann.090311.2.patch, 
> LUCENE-3413.Mattmann.090311.patch.txt
>
>
> I whipped up this CombiningFilter for the following use case:
> I've got a bunch of titles of e.g., Books, such as:
> The Grapes of Wrath
> Tommy Tommerson saves the World
> Top of the World
> The Tales of Beedle the Bard
> Born Free
> etc.
> I want to sort these titles using a String field that includes stopword 
> analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), 
> etc. I created an analysis chain in Solr for this that was based off of 
> *alphaOnlySort*, which looks like this:
> {code:xml}
> <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" 
> omitNorms="true">
>    <analyzer>
>         <!-- KeywordTokenizer does no actual tokenizing, so the entire
>              input string is preserved as a single token
>           -->
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <!-- The LowerCase TokenFilter does what you expect, which can be
>              when you want your sorting to be case insensitive
>           -->
>         <filter class="solr.LowerCaseFilterFactory" />
>         <!-- The TrimFilter removes any leading or trailing whitespace -->
>         <filter class="solr.TrimFilterFactory" />
>         <!-- The PatternReplaceFilter gives you the flexibility to use
>              Java Regular expression to replace any sequence of characters
>              matching a pattern with an arbitrary replacement string, 
>              which may include back references to portions of the original
>              string matched by the pattern.
>              
>              See the Java Regular Expression documentation for more
>              information on pattern and replacement string syntax.
>              
>              
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
>           -->
>         <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="([^a-z])" replacement="" replace="all"
>         /> 
>     </analyzer>       
>     </fieldType>
> {code}
> The issue with alphaOnlySort is that it doesn't support stopword remove or 
> synonyms because those are based on the original token level instead of the 
> full strings produced by the KeywordTokenizer (which does not do 
> tokenization). I needed a filter that would allow me to change alphaOnlySort 
> and its analysis chain from using KeywordTokenizer to using 
> WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, 
> take "The Grapes of Wrath". I needed a way for it to get turned into:
> {noformat}
> grapes of wrath
> {noformat}
> And then to combine those tokens into a single token:
> {noformat}
> grapesofwrath
> {noformat}
> The attached CombiningFilter takes care of that. It doesn't do it super 
> efficiently I'm guessing (since I used a StringBuffer), but I'm open to 
> suggestions on how to make it better. 
> One other thing is that apparently this analyzer works fine for analysis 
> (e.g., it produces the desired tokens), however, for sorting in Solr I'm 
> getting null sort tokens. Need to figure out why. 
> Here ya go!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

Reply via email to