I'm having problems searching for an exact match with a phrase. Essentially, I think my problem is that the tokenizer is tossing the double quotes around the phrase, tokenizing each word and so I end up with the document hit I want plus several more I don't (the latter having some of the words, but not exact matches). Here's the specifics.
First, I'm using the CJKTokenizer from WebLucene which I believe is a modified version of the stopword tokenizer enhanced to handle asian characters (that's according to the header; I don't think the asian characters have anything to do with my problem). The documents I need to search, for reasons related to the application, often end up with hyphenated words in critical places. For example, the original text to be indexed might be something like "this is Bill-Fred". When this is tokenized initially, I end up with two tokens "bill" and "fred" (the tokenizer converts to lower case; "this" and "is" are removed as stop words; the hyphen is removed by the tokenizer). So far so good. I pass the phrase I want an exact match on to a QueryParser in quotes (so "Bill-Fred" is the search string; quotes included). I watched the output of the tokenizer from the query parser and it is clearly tossing the double quotes and tokenizing each word separately. It passes the words "bill" and "fred" as separate entities back to the QueryParser. Looking at the tokenizer code, I understand why. Obviously, that's why I end up with documents that contain the words even if they are not exact matches. Here's the question. I can modify the CJKTokenizer so that when it sees "Fred-Bill" it creates a single token that looks like "fred bill". Would this now work? Is this the right thing to do? I realize this means that I'd hit on "Fred-Bill" and "Fred Bill", but I can probably live with that. However, it also seems like I now have a problem if the original text contains a quotation from someone that happens to be part of the document (i.e., the original text has double quotes in it). It seems like I need to ignore quotes for the initial index, but use them to build phrases when I'm tokenizing a search string in the QueryParser. Do I need two tokenizers? Does any of this make any sense? I'm not quite sure what the QueryParser wants to see to properly do a phrase match. Is QueryParser the wrong thing to be using here? Suggestions or comments? Scott --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
