Just an idea about what Leo said yesterday.
> what about bi/tri-grams + some sort of hit filtering? It will do the
> job. I just saw some ineffective implementation of 1-grams for CJK on
> [EMAIL PROTECTED] It could be a good starting point for full n-gram
> support... Just a thought.
A change in the AliasFilter seemed to work:
private void addAliasesToStack(Token token, Stack aliasStack) {
if(token == null) return;
String tokenString = token.termText();
String tokenSubString = "";
// --- from here ---
int x = 0;
while( tokenString.length() > x+2 ) {
tokenSubString += tokenString.substring( x, x+3 );
tokenSubString += " ";
x++;
}
// --- to here ---
//System.out.println( "SUBSTRING ELEMENTS: "+tokenSubString );
StringTokenizer tokenizer = new StringTokenizer(tokenSubString, " ");
while(tokenizer.hasMoreElements()) {
String nextAlias = tokenizer.nextToken();
Token nextTokenAlias = new Token(nextAlias, 0, nextAlias.length());
aliasStack.push(nextTokenAlias);
}
}
This snippet creates overlapping tri-grams. But I don't know if this is of
any use, a mere notion.
Best regards,
Ren� Hackl
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]