This filter lets you "glue" tokens back together. This has been discussed
and posted on the list before, but this updated version uses all the
preferred 4.x classes.
Normally you wouldn't want to stick tokens back together, but if you've
found this post, you probably have some atypical need for it (as I did)
As an example you could:
* Let tokenizer break up text on white spaces
* Then lowercase
* then remove stop words
* ***then concatenate all the words back together into one string***
You'll need:
* ConcatFilter.java (for lucene, below)
* ConcatFilterFactory.java (for solr, below)
* entry in your schema
schema.xml entry
----------
...
<fieldType ...>
<analyzer>
...
<filter class="solr.ConcatFilterFactory" />
...
</analyzer>
</fieldType>
...
ConcatFilter.java
-----------------
package org.apache.lucene.analysis;
import java.io.IOException;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class ConcatFilter extends TokenFilter {
protected CharTermAttribute charTermAttr;
public ConcatFilter(TokenStream input) {
super(input);
charTermAttr = addAttribute( CharTermAttribute.class );
}
@Override
public boolean incrementToken() throws IOException {
StringBuilder buffer = new StringBuilder();
while( input.incrementToken() ) {
buffer.append( charTermAttr );
}
// We need to clear it either way
charTermAttr.setEmpty();
if ( buffer.length() > 0 ) {
charTermAttr.append( buffer );
return true;
}
else {
return false;
}
}
}
ConcatFilterFactory.java
------------------------
package org.apache.solr.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;
public class ConcatFilterFactory extends TokenFilterFactory {
@Override
public TokenStream create(TokenStream stream) {
return new ConcatFilter(stream);
}
}
--
Mark Bennett / New Idea Engineering, Inc. / [email protected]
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513