On 02/11/2011 17:15, Uwe Schindler wrote:
Hi Paul,
There is WordDelimiterFilter which does exactly what you want. In 3.x its
unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
analyzers-common module.
Okay so I found it and its looks very interesting but really overly
complex for what I want to do and doesnt handle my specific case, could
anyone possibly give a code example
of how I create two tokens from one, assume I already know how to split
it (I cant work that bit out)
public final boolean incrementToken() throws java.io.IOException {
if (!input.incrementToken()) {
return false;
}
char[] buffer = termAtt.buffer();
final int bufferLength = termAtt.length();
final String type = typeAtt.type();
if (type == ALPHANUMANDPUNCTUATION) {
int upto = 0;
for (int i = 0; i < bufferLength; i++) {
char c = buffer[i];
if (!Character.isLetterOrDigit(c) )
{
//TODO PUT ALL CHARS AFTER THIS INTO A NEW TOKEN
}
else {
buffer[upto++] = c;
}
}
termAtt.setLength(upto);
}
return true;
}
-----Original Message-----
From: Paul Taylor [mailto:paul_t...@fastmail.fm]
Sent: Wednesday, November 02, 2011 5:12 PM
To: java-user@lucene.apache.org
Subject: Creating additional tokens from input in a token filter
I have a tokenizer filter that takes tokens and then drops any non
alphanumeric
characters
i.e 'this-stuff' becomes 'thisstuff'
but what I actually want it to do is split the one token into multiple
tokens using
the non-alphanumeric characters as word boundaries
i.e 'this-stuff' becomes 'this stuff'
How do I do this ?
thanks Paul
(You may be wondering why I just didn't filter out these characters at the
tokenizer stage, but I had to keep them in to solve another problem, that
is they
needed to be kept for 'words' that only consisted of non-alphanumeric
characters)
This is my existing class:
public class MusicbrainzTokenizerFilter extends TokenFilter {
/**
* Construct filtering<i>in</i>.
*/
public MusicbrainzTokenizerFilter(TokenStream in) {
super(in);
termAtt = (CharTermAttribute)
addAttribute(CharTermAttribute.class);
typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
}
private static final String ALPHANUMANDPUNCTUATION
=
MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPU
NCTUATION];
// this filters uses attribute type
private TypeAttribute typeAtt;
private CharTermAttribute termAtt;
/**
* Returns the next token in the stream, or null at EOS.
*<p>Removes<tt>'</tt> from the words.
*<p>Removes dots from acronyms.
*/
public final boolean incrementToken() throws java.io.IOException {
if (!input.incrementToken()) {
return false;
}
char[] buffer = termAtt.buffer();
final int bufferLength = termAtt.length();
final String type = typeAtt.type();
if (type == ALPHANUMANDPUNCTUATION) { // remove no alpha
numerics
int upto = 0;
for (int i = 0; i< bufferLength; i++) {
char c = buffer[i];
if (!Character.isLetterOrDigit(c) )
{
//Do Nothing, (drop the character)
}
else {
buffer[upto++] = c;
}
}
termAtt.setLength(upto);
}
return true;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org