[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

Chris A. Mattmann (JIRA) Sat, 03 Sep 2011 23:01:02 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096819#comment-13096819
 ]


Chris A. Mattmann commented on LUCENE-3413:
-------------------------------------------

Hmmm, maybe #reset is getting called somewhere. I wrote another unit test to 
call reset and then test calling incrementToken again. As it turns out, it 
fails, because calling input.reset in CombiningFilter calls e.g., 
LowerCaseFilter.reset, which in turn calls KeywordTokenizer.reset. The call to 
KeywordTokenizer.reset does *nothing*, and it just uses the stub method in 
TokenStream, even though KeywordTokenizer has a method #reset that takes a 
Reader input. 

I wonder if the lack of having a working reset method is messing stuff up. What 
tells me that's probably wrong though is that LowerCaseFilter just uses the 
default parent class #reset (which just calls its input.reset), so I don't 
think that's an issue. Sigh.


> CombiningFilter to recombine tokens into a single token for sorting
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3413
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3413
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 2.9.3
>            Reporter: Chris A. Mattmann
>            Priority: Minor
>         Attachments: LUCENE-3413.Mattmann.090311.2.patch, 
> LUCENE-3413.Mattmann.090311.patch.txt
>
>
> I whipped up this CombiningFilter for the following use case:
> I've got a bunch of titles of e.g., Books, such as:
> The Grapes of Wrath
> Tommy Tommerson saves the World
> Top of the World
> The Tales of Beedle the Bard
> Born Free
> etc.
> I want to sort these titles using a String field that includes stopword 
> analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), 
> etc. I created an analysis chain in Solr for this that was based off of 
> *alphaOnlySort*, which looks like this:
> {code:xml}
> <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" 
> omitNorms="true">
>    <analyzer>
>         <!-- KeywordTokenizer does no actual tokenizing, so the entire
>              input string is preserved as a single token
>           -->
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <!-- The LowerCase TokenFilter does what you expect, which can be
>              when you want your sorting to be case insensitive
>           -->
>         <filter class="solr.LowerCaseFilterFactory" />
>         <!-- The TrimFilter removes any leading or trailing whitespace -->
>         <filter class="solr.TrimFilterFactory" />
>         <!-- The PatternReplaceFilter gives you the flexibility to use
>              Java Regular expression to replace any sequence of characters
>              matching a pattern with an arbitrary replacement string, 
>              which may include back references to portions of the original
>              string matched by the pattern.
>              
>              See the Java Regular Expression documentation for more
>              information on pattern and replacement string syntax.
>              
>              
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
>           -->
>         <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="([^a-z])" replacement="" replace="all"
>         /> 
>     </analyzer>       
>     </fieldType>
> {code}
> The issue with alphaOnlySort is that it doesn't support stopword remove or 
> synonyms because those are based on the original token level instead of the 
> full strings produced by the KeywordTokenizer (which does not do 
> tokenization). I needed a filter that would allow me to change alphaOnlySort 
> and its analysis chain from using KeywordTokenizer to using 
> WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, 
> take "The Grapes of Wrath". I needed a way for it to get turned into:
> {noformat}
> grapes of wrath
> {noformat}
> And then to combine those tokens into a single token:
> {noformat}
> grapesofwrath
> {noformat}
> The attached CombiningFilter takes care of that. It doesn't do it super 
> efficiently I'm guessing (since I used a StringBuffer), but I'm open to 
> suggestions on how to make it better. 
> One other thing is that apparently this analyzer works fine for analysis 
> (e.g., it produces the desired tokens), however, for sorting in Solr I'm 
> getting null sort tokens. Need to figure out why. 
> Here ya go!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

Reply via email to