Actually, I found your "QueryParser Rules" article the most useful. It explained a number of things that I had puzzled about. Query.toString() helped also.
So, obvious in hindsight, an exact phrase match still goes through the tokenizer. If there are stop words or you're stemming or etc., you need to tokenize the phrase before trying to get an exact match. Clearly, that has implications for what "exact phrase match" means. The toString() told me that the quotes are handled by the queryParser. The weblucene cjk tokenizer works just fine with it and I didn't make any changes to it. The "bad" news is that after going through all of this, the code just started to work as expected. I'm not sure what I did to fix it. There is a minor issue I found that I think works as documented, but wonder why it's that way. If you enter a search string that's a hyphenated word such as "fred-bill" (w/o the quotes), the QueryParser generates a search string to find all documents with fred but w/o bill. I believe this is expected behavior based on the javadocs. The effect of this is that a hyphenated word gives unexpected results unless surrounded by quotes. Perhaps the syntax should have been "fred -bill" (space before the hyphen required) to indicate that you didn't want bill and that it's not a hyphenated word. Seems a tad more general. It's an issue for me because my application deals with hyphenated words a lot and I don't think my users would ever understand when quotes should be used and when they should not (most of them won't figure out how to use the "not" syntax). I can solve it by requiring the user to enter a space before the hyphen if they mean "not" and then have the search code automatically add the quotes for hyphenated words. It's just a little painful. Just a thought for 1.4. ;-) -----Original Message----- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 03, 2004 8:26 PM To: Lucene Users List Subject: Re: Newbie Phrase Query question The best suggestion I have is to look at the code in my first java.net article (Intro Lucene) and borrow the Analyzer utility code to see what happens to a sample string as it is analyzed. Then pass that same string to QueryParser (along with the same analyzer) and see what the Query.toString(<default field name>) returns. This should shed light on the issue more clearly. Erik On Feb 3, 2004, at 10:01 PM, Scott Smith wrote: > I'm having problems searching for an exact match with a phrase. > Essentially, I think my problem is that the tokenizer is tossing the > double quotes around the phrase, tokenizing each word and so I end up > with the document hit I want plus several more I don't (the latter > having some of the words, but not exact matches). Here's the > specifics. > > > First, I'm using the CJKTokenizer from WebLucene which I believe is a > modified version of the stopword tokenizer enhanced to handle asian > characters (that's according to the header; I don't think the asian > characters have anything to do with my problem). > > The documents I need to search, for reasons related to the > application, often end up with hyphenated words in critical places. > For example, the original text to be indexed might be something like > "this is Bill-Fred". > > When this is tokenized initially, I end up with two tokens "bill" and > "fred" (the tokenizer converts to lower case; "this" and "is" are > removed as stop words; the hyphen is removed by the tokenizer). So > far so good. > > I pass the phrase I want an exact match on to a QueryParser in quotes > (so "Bill-Fred" is the search string; quotes included). I watched the > output of the tokenizer from the query parser and it is clearly > tossing the double quotes and tokenizing each word separately. It > passes the words "bill" and "fred" as separate entities back to the > QueryParser. Looking at the tokenizer code, I understand why. > Obviously, that's why I end up with documents that contain the words > even if they are not exact matches. > > Here's the question. I can modify the CJKTokenizer so that when it > sees > "Fred-Bill" it creates a single token that looks like "fred bill". > Would this now work? Is this the right thing to do? I realize this > means that I'd hit on "Fred-Bill" and "Fred Bill", but I can probably > live with that. > > However, it also seems like I now have a problem if the original text > contains a quotation from someone that happens to be part of the > document (i.e., the original text has double quotes in it). It seems > like I need to ignore quotes for the initial index, but use them to > build phrases when I'm tokenizing a search string in the QueryParser. > Do I need two tokenizers? > > Does any of this make any sense? I'm not quite sure what the > QueryParser wants to see to properly do a phrase match. Is > QueryParser the wrong thing to be using here? Suggestions or > comments? > > Scott > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
