On Nov 30, 2004, at 2:29 PM, Ricardo Lopes wrote:
> My guess is that your analyzer is what did the splitting

After looker with more attetion to the code i found that the tokenStream method in the BrazilianAnalyzer calls the StandardTokenizer and is this the one that split the search string, is there a simple way of subclass the tokenizer to avoid splitting those characters or do i have make a custom implementation of that class.

You can verify this by using the AnalysisDemo referenced here:

        http://wiki.apache.org/jakarta-lucene/AnalysisParalysis

Or use Luke - http://www.getopt.org/luke/ - which has a nice plugin page that can do this type of analysis inspection (you'll have to add the sandbox analyzer JAR to the classpath when launching Luke).

As for subclassing StandardTokenizer - no, you won't have much luck there. StandardTokenizer is a JavaCC-based tokenizer and is not designed for subclassing to control this sort of thing.

As this only happends when i make a search (during indexing the splitting of those characters doesn't happend)

Are you sure that splitting is not happening during indexing? If the AnalysisDemo (or Luke) run on your string splits then it is splitting at indexing time too. Keep in mind that looking at a field's value is showing you the stored *original* value, not the tokenized values.


i thought that i had to do with the QueryParser, but it seems that the problem is with the StandardTokenizer.

I'm not sure - I haven't tried that string with the analyzer you provided. If it was with StandardTokenizer and you're using the same analyzer for indexing and searching, you'd have the values split in both places - which is actually fine as searches would match what was indexed :)


        Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to