Do a while for a StringTokenized . new StringTokenized (VarToTokenized," "); ( this return a list of tokens with the words split by an space.
jOliveira On Mon, Jan 5, 2009 at 7:58 PM, Steven A Rowe <[email protected]> wrote: > Hi ayyanar, > > I should have mentioned in my previous email that the > [email protected] mailing list has very few subscribers - you'll > get much better response on the [email protected] mailing list. > > On 01/05/2009 at 3:07 PM, ayyanar wrote: > > My objective is to retain the keyword (input stream) as is a token like > > a keyword tokenizer does and also split the keyword by whitespace and > > maintain that tokens as a white space tokenizer does > > Right, ShingleFilter won't do this for you. > > The following, if used to filter WhitespaceTokenizer's output, is similar > to what you want (note: untested, and also note that this assumes you're > using Lucene v2.4.0, and not a recent trunk version, which includes the new > TokenStream API introduced with LUCENE-1422: < > https://issues.apache.org/jira/browse/LUCENE-1422>): > > ----- > > /** > * Extends CachingTokenFilter to output a space-separated- > * concatenated-all-input-stream-terms token, followed by > * all of the original input stream tokens. > * One for all and (then) all for one! > */ > public class ThreeMusketeersFilter extends CachingTokenFilter { > > private boolean concatenatedTokenOutput = false; > > public ThreeMusketeersFilter(TokenStream input) { > super(input); > } > > public Token next(final Token reusableToken) throws IOException { > assert reusableToken != null; > if (concatenatedTokenOutput) { > return super.next(reusableToken); > } else { > concatenatedTokenOutput = true; > Token firstToken = super.next(reusableToken); > if (firstToken == null) { > return null; > } > StringBuffer buffer = new StringBuffer(); > buffer.append(firstToken.termBuffer()); > int start = firstToken.startOffset(); > int end = firstToken.endOffset(); > for (Token nextToken = super.next(reusableToken) ; > nextToken != null ; > nextToken = super.next(reusableToken)) { > end = nextToken.endOffset(); > buffer.append(' '); // add a space between terms > buffer.append(nextToken.termBuffer()); > } > reusableToken.clear(); > reusableToken.resizeTermBuffer(buffer.length()); > reusableToken.setTermLength(buffer.length()); > buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0); > reusableToken.setStartOffset(start); > reusableToken.setEndOffset(end); > super.reset(); // Rewind input stream to get the individual tokens > return reusableToken; > } > } > > public void reset() throws IOException { > super.reset(); > concatenatedTokenOutput = false; > } > } > -- Saludos Julio Oliveira - Buenos Aires [email protected] http://www.linkedin.com/in/juliomoliveira
