[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

Robert Muir (JIRA) Fri, 28 Dec 2012 08:00:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540503#comment-13540503
 ]


Robert Muir commented on LUCENE-3413:
-------------------------------------

A few comments:
* TestCombiningFilter should extend BaseTokenStreamTestCase, build an Analyzer 
with MockTokenizer+this filter and use BaseTokenStreamTestCase asserts (see 
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/TestJapaneseKatakanaStemFilter.java?view=markup
 as a good example of an analysis unit test).
* \@author tags should be removed
* indentation should be 2 spaces not tabs.
* instead of throwing away the return value of 
addAttribute(TermAttribute.class) in the ctor, just initialize this as an 
instance variable:

{code}
public class CombiningFilter extends TokenFilter {
  private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
{code}

This way you dont have to constantly look it up from the attribute map for each 
token, instead you just access "termAtt".
* once the code is updated to CharTermAttribute, the various string creations 
can be eliminated, since it implements Appendable and CharSequence. so instead 
of 

{code}
builder.append(ta.term());
{code}

just do:
{code}
builder.append(termAtt);
{code}

and same at the end, instead of
{code}
ta.setTermBuffer(builder.toString());
{code}

just do:
{code}
termAtt.setEmpty().append(builder);
{code}

* in reset(), i would just call super.reset() instead of "this.input.reset()". 
This is a little cleaner and accomplishes the same thing (its how the other 
tokenfilters do this).
                
> CombiningFilter to recombine tokens into a single token for sorting
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3413
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3413
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 2.9.3
>            Reporter: Chris A. Mattmann
>            Priority: Minor
>         Attachments: LUCENE-3413.Mattmann.090311.patch.txt, 
> LUCENE-3413.Mattmann.090511.patch.txt
>
>
> I whipped up this CombiningFilter for the following use case:
> I've got a bunch of titles of e.g., Books, such as:
> The Grapes of Wrath
> Tommy Tommerson saves the World
> Top of the World
> The Tales of Beedle the Bard
> Born Free
> etc.
> I want to sort these titles using a String field that includes stopword 
> analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), 
> etc. I created an analysis chain in Solr for this that was based off of 
> *alphaOnlySort*, which looks like this:
> {code:xml}
> <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" 
> omitNorms="true">
>    <analyzer>
>         <!-- KeywordTokenizer does no actual tokenizing, so the entire
>              input string is preserved as a single token
>           -->
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <!-- The LowerCase TokenFilter does what you expect, which can be
>              when you want your sorting to be case insensitive
>           -->
>         <filter class="solr.LowerCaseFilterFactory" />
>         <!-- The TrimFilter removes any leading or trailing whitespace -->
>         <filter class="solr.TrimFilterFactory" />
>         <!-- The PatternReplaceFilter gives you the flexibility to use
>              Java Regular expression to replace any sequence of characters
>              matching a pattern with an arbitrary replacement string, 
>              which may include back references to portions of the original
>              string matched by the pattern.
>              
>              See the Java Regular Expression documentation for more
>              information on pattern and replacement string syntax.
>              
>              
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
>           -->
>         <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="([^a-z])" replacement="" replace="all"
>         /> 
>     </analyzer>       
>     </fieldType>
> {code}
> The issue with alphaOnlySort is that it doesn't support stopword remove or 
> synonyms because those are based on the original token level instead of the 
> full strings produced by the KeywordTokenizer (which does not do 
> tokenization). I needed a filter that would allow me to change alphaOnlySort 
> and its analysis chain from using KeywordTokenizer to using 
> WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, 
> take "The Grapes of Wrath". I needed a way for it to get turned into:
> {noformat}
> grapes of wrath
> {noformat}
> And then to combine those tokens into a single token:
> {noformat}
> grapesofwrath
> {noformat}
> The attached CombiningFilter takes care of that. It doesn't do it super 
> efficiently I'm guessing (since I used a StringBuffer), but I'm open to 
> suggestions on how to make it better. 
> One other thing is that apparently this analyzer works fine for analysis 
> (e.g., it produces the desired tokens), however, for sorting in Solr I'm 
> getting null sort tokens. Need to figure out why. 
> Here ya go!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

Reply via email to