[ https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540503#comment-13540503 ]
Robert Muir commented on LUCENE-3413: ------------------------------------- A few comments: * TestCombiningFilter should extend BaseTokenStreamTestCase, build an Analyzer with MockTokenizer+this filter and use BaseTokenStreamTestCase asserts (see http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/TestJapaneseKatakanaStemFilter.java?view=markup as a good example of an analysis unit test). * \@author tags should be removed * indentation should be 2 spaces not tabs. * instead of throwing away the return value of addAttribute(TermAttribute.class) in the ctor, just initialize this as an instance variable: {code} public class CombiningFilter extends TokenFilter { private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); {code} This way you dont have to constantly look it up from the attribute map for each token, instead you just access "termAtt". * once the code is updated to CharTermAttribute, the various string creations can be eliminated, since it implements Appendable and CharSequence. so instead of {code} builder.append(ta.term()); {code} just do: {code} builder.append(termAtt); {code} and same at the end, instead of {code} ta.setTermBuffer(builder.toString()); {code} just do: {code} termAtt.setEmpty().append(builder); {code} * in reset(), i would just call super.reset() instead of "this.input.reset()". This is a little cleaner and accomplishes the same thing (its how the other tokenfilters do this). > CombiningFilter to recombine tokens into a single token for sorting > ------------------------------------------------------------------- > > Key: LUCENE-3413 > URL: https://issues.apache.org/jira/browse/LUCENE-3413 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Affects Versions: 2.9.3 > Reporter: Chris A. Mattmann > Priority: Minor > Attachments: LUCENE-3413.Mattmann.090311.patch.txt, > LUCENE-3413.Mattmann.090511.patch.txt > > > I whipped up this CombiningFilter for the following use case: > I've got a bunch of titles of e.g., Books, such as: > The Grapes of Wrath > Tommy Tommerson saves the World > Top of the World > The Tales of Beedle the Bard > Born Free > etc. > I want to sort these titles using a String field that includes stopword > analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), > etc. I created an analysis chain in Solr for this that was based off of > *alphaOnlySort*, which looks like this: > {code:xml} > <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" > omitNorms="true"> > <analyzer> > <!-- KeywordTokenizer does no actual tokenizing, so the entire > input string is preserved as a single token > --> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <!-- The LowerCase TokenFilter does what you expect, which can be > when you want your sorting to be case insensitive > --> > <filter class="solr.LowerCaseFilterFactory" /> > <!-- The TrimFilter removes any leading or trailing whitespace --> > <filter class="solr.TrimFilterFactory" /> > <!-- The PatternReplaceFilter gives you the flexibility to use > Java Regular expression to replace any sequence of characters > matching a pattern with an arbitrary replacement string, > which may include back references to portions of the original > string matched by the pattern. > > See the Java Regular Expression documentation for more > information on pattern and replacement string syntax. > > > http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html > --> > <filter class="solr.PatternReplaceFilterFactory" > pattern="([^a-z])" replacement="" replace="all" > /> > </analyzer> > </fieldType> > {code} > The issue with alphaOnlySort is that it doesn't support stopword remove or > synonyms because those are based on the original token level instead of the > full strings produced by the KeywordTokenizer (which does not do > tokenization). I needed a filter that would allow me to change alphaOnlySort > and its analysis chain from using KeywordTokenizer to using > WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, > take "The Grapes of Wrath". I needed a way for it to get turned into: > {noformat} > grapes of wrath > {noformat} > And then to combine those tokens into a single token: > {noformat} > grapesofwrath > {noformat} > The attached CombiningFilter takes care of that. It doesn't do it super > efficiently I'm guessing (since I used a StringBuffer), but I'm open to > suggestions on how to make it better. > One other thing is that apparently this analyzer works fine for analysis > (e.g., it produces the desired tokens), however, for sorting in Solr I'm > getting null sort tokens. Need to figure out why. > Here ya go! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org