While it is possible to alter the StandardAnalyzer, depending on more details of your source text, it may be better to use a different analyzer or make your own. The StandardAnalyzer is quite slow if you do not need all of its features, and modifying it will make it harder to keep up with bug fixes or improvements.

That said, StandardAnalyzer does split on commas, so you might want to check into whats really going on.

I suspect that 'word1,word2,word3,word4,word5' is being recognized as a NUM by StandardAnalzyer. A NUM match will keep a comma deliminated list intact as long as every other word contains a digit.

You might alter the <#P regular expression in StandardAnalyzer.jj by taking out the ','. This will take out certain matches (like the match your getting <g>), but will stop screwing up your matches.

- Mark

Jeff wrote:
I have documents with lots of text. Part of the text is in the following
format:

word1,word2,word3,word4,word5

I am currently using the StandardAnalyzer and everything is working great
with the other data, except I can't query for 'word3' as a ',' isn't a token
seperator. Is there an easy way to add ',' as a token seperator?

Thanks,

-Jeff


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to