Hi Simon, I'd love to see a ConcatFilter and factory find a permanent home as part of the stable to standard filters. But perhaps for the Automaton function it'd need to be packaged differently?
-- Mark Bennett / New Idea Engineering, Inc. / [email protected] Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Thu, Nov 1, 2012 at 12:33 PM, Simon Willnauer <[email protected]>wrote: > I used "combine" filters before too. I think there is a usecase for > this stuff we do similar things in suggesters with > TokenStreamToAutomaton and finite strings. That is really the same > kind of thing though. maybe we can wrap it in a tokenstream and emit > the finite path as synonyms ie . on the same position? > > simon > > On Thu, Nov 1, 2012 at 8:16 PM, Uwe Schindler <[email protected]> wrote: > > Hi Otis, > > > > > > > > One use case I had for a similar filter for a customer was some ngramming > > approach. The tokenization before was there to create “normalized” > tokens, > > which were then be glued together (with or w/o whitespace) and ngrammed > > (means several ngram tokens created from the glued-together thingie). > > > > > > > > Uwe > > > > > > > > ----- > > > > Uwe Schindler > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > http://www.thetaphi.de > > > > eMail: [email protected] > > > > > > > > From: Otis Gospodnetic [mailto:[email protected]] > > Sent: Thursday, November 01, 2012 8:01 PM > > To: [email protected] > > Subject: Re: Posting updated ConcatFilter code, using 4.0.0 compatible > > classes > > > > > > > > Hi Mark, > > > > > > > > Out of curiosity, what was your use case? > > > > > > > > Thanks, > > Otis > > > > -- > > Search Analytics - http://sematext.com/search-analytics/index.html > > Performance Monitoring - http://sematext.com/spm/index.html > > > > On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <[email protected]> > wrote: > > > > This filter lets you "glue" tokens back together. This has been > discussed > > and posted on the list before, but this updated version uses all the > > preferred 4.x classes. > > > > Normally you wouldn't want to stick tokens back together, but if you've > > found this post, you probably have some atypical need for it (as I did) > > As an example you could: > > * Let tokenizer break up text on white spaces > > * Then lowercase > > * then remove stop words > > * ***then concatenate all the words back together into one string*** > > > > You'll need: > > * ConcatFilter.java (for lucene, below) > > * ConcatFilterFactory.java (for solr, below) > > * entry in your schema > > > > schema.xml entry > > ---------- > > ... > > <fieldType ...> > > <analyzer> > > ... > > <filter class="solr.ConcatFilterFactory" /> > > ... > > </analyzer> > > </fieldType> > > ... > > > > ConcatFilter.java > > ----------------- > > package org.apache.lucene.analysis; > > import java.io.IOException; > > import org.apache.lucene.analysis.TokenFilter; > > import org.apache.lucene.analysis.TokenStream; > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > > public class ConcatFilter extends TokenFilter { > > protected CharTermAttribute charTermAttr; > > public ConcatFilter(TokenStream input) { > > super(input); > > charTermAttr = addAttribute( CharTermAttribute.class ); > > } > > @Override > > public boolean incrementToken() throws IOException { > > StringBuilder buffer = new StringBuilder(); > > while( input.incrementToken() ) { > > buffer.append( charTermAttr ); > > } > > // We need to clear it either way > > charTermAttr.setEmpty(); > > if ( buffer.length() > 0 ) { > > charTermAttr.append( buffer ); > > return true; > > } > > else { > > return false; > > } > > } > > } > > > > ConcatFilterFactory.java > > ------------------------ > > package org.apache.solr.analysis; > > import org.apache.lucene.analysis.TokenStream; > > import org.apache.lucene.analysis.util.TokenFilterFactory; > > public class ConcatFilterFactory extends TokenFilterFactory { > > @Override > > public TokenStream create(TokenStream stream) { > > return new ConcatFilter(stream); > > } > > } > > > > > > -- > > Mark Bennett / New Idea Engineering, Inc. / [email protected] > > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
