[ https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547555#comment-13547555 ]
Alexandre Rafalovitch commented on LUCENE-3413: ----------------------------------------------- Any chance this filter could take an optional 'connector' parameter to put between tokens when joining them? That way one could use '_' for sorting and (my need) a ' ' for recreating original string after stripping some token types. > CombiningFilter to recombine tokens into a single token for sorting > ------------------------------------------------------------------- > > Key: LUCENE-3413 > URL: https://issues.apache.org/jira/browse/LUCENE-3413 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Affects Versions: 2.9.3 > Reporter: Chris A. Mattmann > Priority: Minor > Attachments: LUCENE-3413.Mattmann.090311.patch.txt, > LUCENE-3413.Mattmann.090511.patch.txt > > > I whipped up this CombiningFilter for the following use case: > I've got a bunch of titles of e.g., Books, such as: > The Grapes of Wrath > Tommy Tommerson saves the World > Top of the World > The Tales of Beedle the Bard > Born Free > etc. > I want to sort these titles using a String field that includes stopword > analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), > etc. I created an analysis chain in Solr for this that was based off of > *alphaOnlySort*, which looks like this: > {code:xml} > <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" > omitNorms="true"> > <analyzer> > <!-- KeywordTokenizer does no actual tokenizing, so the entire > input string is preserved as a single token > --> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <!-- The LowerCase TokenFilter does what you expect, which can be > when you want your sorting to be case insensitive > --> > <filter class="solr.LowerCaseFilterFactory" /> > <!-- The TrimFilter removes any leading or trailing whitespace --> > <filter class="solr.TrimFilterFactory" /> > <!-- The PatternReplaceFilter gives you the flexibility to use > Java Regular expression to replace any sequence of characters > matching a pattern with an arbitrary replacement string, > which may include back references to portions of the original > string matched by the pattern. > > See the Java Regular Expression documentation for more > information on pattern and replacement string syntax. > > > http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html > --> > <filter class="solr.PatternReplaceFilterFactory" > pattern="([^a-z])" replacement="" replace="all" > /> > </analyzer> > </fieldType> > {code} > The issue with alphaOnlySort is that it doesn't support stopword remove or > synonyms because those are based on the original token level instead of the > full strings produced by the KeywordTokenizer (which does not do > tokenization). I needed a filter that would allow me to change alphaOnlySort > and its analysis chain from using KeywordTokenizer to using > WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, > take "The Grapes of Wrath". I needed a way for it to get turned into: > {noformat} > grapes of wrath > {noformat} > And then to combine those tokens into a single token: > {noformat} > grapesofwrath > {noformat} > The attached CombiningFilter takes care of that. It doesn't do it super > efficiently I'm guessing (since I used a StringBuffer), but I'm open to > suggestions on how to make it better. > One other thing is that apparently this analyzer works fine for analysis > (e.g., it produces the desired tokens), however, for sorting in Solr I'm > getting null sort tokens. Need to figure out why. > Here ya go! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org