Hi ayyanar, I should have mentioned in my previous email that the [email protected] mailing list has very few subscribers - you'll get much better response on the [email protected] mailing list.
On 01/05/2009 at 3:07 PM, ayyanar wrote: > My objective is to retain the keyword (input stream) as is a token like > a keyword tokenizer does and also split the keyword by whitespace and > maintain that tokens as a white space tokenizer does Right, ShingleFilter won't do this for you. The following, if used to filter WhitespaceTokenizer's output, is similar to what you want (note: untested, and also note that this assumes you're using Lucene v2.4.0, and not a recent trunk version, which includes the new TokenStream API introduced with LUCENE-1422: <https://issues.apache.org/jira/browse/LUCENE-1422>): ----- /** * Extends CachingTokenFilter to output a space-separated- * concatenated-all-input-stream-terms token, followed by * all of the original input stream tokens. * One for all and (then) all for one! */ public class ThreeMusketeersFilter extends CachingTokenFilter { private boolean concatenatedTokenOutput = false; public ThreeMusketeersFilter(TokenStream input) { super(input); } public Token next(final Token reusableToken) throws IOException { assert reusableToken != null; if (concatenatedTokenOutput) { return super.next(reusableToken); } else { concatenatedTokenOutput = true; Token firstToken = super.next(reusableToken); if (firstToken == null) { return null; } StringBuffer buffer = new StringBuffer(); buffer.append(firstToken.termBuffer()); int start = firstToken.startOffset(); int end = firstToken.endOffset(); for (Token nextToken = super.next(reusableToken) ; nextToken != null ; nextToken = super.next(reusableToken)) { end = nextToken.endOffset(); buffer.append(' '); // add a space between terms buffer.append(nextToken.termBuffer()); } reusableToken.clear(); reusableToken.resizeTermBuffer(buffer.length()); reusableToken.setTermLength(buffer.length()); buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0); reusableToken.setStartOffset(start); reusableToken.setEndOffset(end); super.reset(); // Rewind input stream to get the individual tokens return reusableToken; } } public void reset() throws IOException { super.reset(); concatenatedTokenOutput = false; } }
