I tend to agree with Mark. I tried a query as so... TermQuery query = new TermQuery(new Term("keywordField", "phrase test")); IndexSearcher searcher= new IndexSearcher(activeIdx); Hits hits = searcher.search(query);
And this produced the expected results. When building the index, I did NOT enclose the keywords in quotes -- just added as UN_TOKENIZED. Philip Mark Miller-5 wrote: > > I think if he wants to use the queryparser to parse his search strings > that he has no choice but to modify it. It will eat any pair of quotes > going through it no matter what analyzer is used. > > - Mark >> Well, you're flying blind. Is the behavior rooted in the indexing or >> querying? Since you can't answer that, you're reduced to trying random >> things hoping that one of them works. A little like voodoo. I've wasted >> faaaaarrrrrr too much time trying to solve what I was *sure* was the >> problem >> only to find it was somewhere else (the last place I look, of course) >> <G>... >> >> Using Luke on a RAMDir. No, I don't know how to, but it should be a >> simple >> thing to write the index to an FSDir at the same time you create your >> RAMDir >> and use Luke then. This is debugging, after all. >> >> I'd be really, really, really reluctant to modify the query parser and/or >> the tokenizer, since whenever I've been tempted it's usually because I >> don't >> understand the tools already provided. Then I have to maintain my custom >> code. Which sucks. Although it sure feels more productive to hack a >> bunch of >> code and get something that works 90% of the time, then spend weeks >> making >> the other 10% work than taking two days to find the 3 lines you *really* >> need <G>. >> >> Have you thought of a PatternAnalyzer? It takes a regular expression >> as the >> tokenizer and (from the Javadoc) >> <<< Efficient Lucene analyzer/tokenizer that preferably operates on a >> String >> rather than a >> Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>, >> that can flexibly separate text into terms via a regular expression >> Pattern<http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with >> >> >> behaviour identical to >> String.split(String)<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29>), >> >> >> and that combines the functionality of >> LetterTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html>, >> >> >> LowerCaseTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html>, >> >> >> WhitespaceTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html>, >> >> >> StopFilter<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html>into >> >> >> a single efficient multi-purpose class.>>> >> >> One word of caution, the regular expression consists of expressions that >> *break* tokens, not expressions that *form* words, which threw me at >> first. >> Just like the doc says, like splitstring <G>.... This is in 2.0, >> although I >> *believe* it's also in the contrib section of 1.9 (or is in the >> regular API, >> I forget). >> >> Best >> Erick >> >> On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote: >>> >>> >>> No, I've never used Luke. Is there an easy way to examine my >>> RAMDirectory >>> index? I can create the index with no quoted keywords, and when I >>> search >>> for a keyword, I get back the expected results (just can't search for a >>> phrase that has whitespace in it). If I create the index with >>> phrases in >>> quotes, then when I search for anything in double quotes, I get back >>> nothing. If I create the index with everything in quotes, then when I >>> search for anything by the keyword field, I get nothing, regardless of >>> whether I use quotes in the query string or not. (I can get results >>> back >>> by >>> searching on other fields.) What do you think? >>> >>> Philip >>> >>> >>> Erick Erickson wrote: >>> > >>> > OK, I've gotta ask. Have you examined your index with Luke to see if >>> what >>> > you *think* is in the index actually *is*??? >>> > >>> > Erick >>> > >>> > On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote: >>> >> >>> >> >>> >> Interesting...just ran a test where I put double quotes around >>> everything >>> >> (including single keywords) of source text and then ran searches >>> for a >>> >> known >>> >> keyword with and without double quotes -- doesn't find either time. >>> >> >>> >> >>> >> Mark Miller-5 wrote: >>> >> > >>> >> > Sorry to hear you're having trouble. You indeed need the double >>> quotes >>> >> in >>> >> > the source text. You will also need them in the query string. Make >>> sure >>> >> > they >>> >> > are in both places. My machine is hosed right now or I would do it >>> for >>> >> you >>> >> > real quick. My guess is that I forgot to mention...no only do you >>> need >>> >> to >>> >> > add the <QUOTED> definiton to the TOKEN section, but below that you >>> >> will >>> >> > find the grammer...you need to add <QUOTED> to the grammer. If you >>> look >>> >> > how >>> >> > <NUM> and <APOSTROPHE> are done you will prob see what you >>> should do. >>> >> If >>> >> > not, my machine should be back up tomarrow... >>> >> > >>> >> > - Mark >>> >> > >>> >> > On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote: >>> >> >> >>> >> >> >>> >> >> Well, I tried that, and it doesn't seem to work still. I would be >>> >> happy >>> >> >> to >>> >> >> zip up the new files, so you can see what I'm using -- maybe >>> you can >>> >> get >>> >> >> it >>> >> >> to work. The first time, I tried building the documents without >>> >> quotes >>> >> >> surrounding each phrase. Then, I retried by enclosing every >>> phrase >>> >> >> within >>> >> >> double quotes. Neither seemed to work. When constructing the >>> query >>> >> >> string >>> >> >> for the search, I always added the double quotes (otherwise, it'd >>> >> think >>> >> >> it >>> >> >> was multiple terms). (I didn't even test the underscore and >>> >> hyphenated >>> >> >> terms.) I thought Lucene was (sort of by default) set up to >>> search >>> >> >> quoted >>> >> >> phrases. From >>> http://lucene.apache.org/java/docs/api/index.html --> >>> A >>> >> >> Phrase is a group of words surrounded by double quotes such as >>> "hello >>> >> >> dolly". So, this should be easy, right? I must be missing >>> something >>> >> >> stupid. >>> >> >> >>> >> >> Thanks, >>> >> >> >>> >> >> Philip >>> >> >> >>> >> >> >>> >> >> Mark Miller-5 wrote: >>> >> >> > >>> >> >> > So this will recognize anything in quotes as a single token and >>> '_' >>> >> and >>> >> >> > '-' will not break up words. There may be some repercussions for >>> the >>> >> >> NUM >>> >> >> > token but nothing I'd worry about. maybe you want to use Unicode >>> for >>> >> >> '-' >>> >> >> > and '_' as well...I wouldn't worry about it myself. >>> >> >> > >>> >> >> > - Mark >>> >> >> > >>> >> >> > >>> >> >> > TOKEN : { // token patterns >>> >> >> > >>> >> >> > // basic word: a sequence of digits & letters >>> >> >> > <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ > >>> >> >> > >>> >> >> > | <QUOTED: "\"" (~["\""])+ "\""> >>> >> >> > >>> >> >> > // internal apostrophes: O'Reilly, you're, O'Reilly's >>> >> >> > // use a post-filter to remove possesives >>> >> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ > >>> >> >> > >>> >> >> > // acronyms: U.S.A., I.B.M., etc. >>> >> >> > // use a post-filter to remove dots >>> >> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ > >>> >> >> > >>> >> >> > // company names like AT&T and [EMAIL PROTECTED] >>> >> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> > >>> >> >> > >>> >> >> > // email addresses >>> >> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM> >>> >> >> > (("."|"-") <ALPHANUM>)+ > >>> >> >> > >>> >> >> > // hostname >>> >> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ > >>> >> >> > >>> >> >> > // floating point, serial, model numbers, ip addresses, etc. >>> >> >> > // every other segment must have at least one digit >>> >> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT> >>> >> >> > | <HAS_DIGIT> <P> <ALPHANUM> >>> >> >> > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ >>> >> >> > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ >>> >> >> > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> >>> >> <HAS_DIGIT>)+ >>> >> >> > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> >>> >> <ALPHANUM>)+ >>> >> >> > ) >>> >> >> > > >>> >> >> > | <#P: ("_"|"-"|"/"|"."|",") > >>> >> >> > | <#HAS_DIGIT: // at least one digit >>> >> >> > (<LETTER>|<DIGIT>)* >>> >> >> > <DIGIT> >>> >> >> > (<LETTER>|<DIGIT>)* >>> >> >> > > >>> >> >> > >>> >> >> > | < #ALPHA: (<LETTER>)+> >>> >> >> > | < #LETTER: // unicode letters >>> >> >> > [ >>> >> >> > "\u0041"-"\u005a", >>> >> >> > "\u0061"-"\u007a", >>> >> >> > "\u00c0"-"\u00d6", >>> >> >> > "\u00d8"-"\u00f6", >>> >> >> > "\u00f8"-"\u00ff", >>> >> >> > "\u0100"-"\u1fff", >>> >> >> > "-", "_" >>> >> >> > ] >>> >> >> > > >>> >> >> > >>> >> >> > >>> >> --------------------------------------------------------------------- >>> >> >> > To unsubscribe, e-mail: [EMAIL PROTECTED] >>> >> >> > For additional commands, e-mail: >>> [EMAIL PROTECTED] >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> >>> >> >> -- >>> >> >> View this message in context: >>> >> >> >>> >> >>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920 >>> >>> >>> >> >> Sent from the Lucene - Java Users forum at Nabble.com. >>> >> >> >>> >> >> >>> >> >> >>> --------------------------------------------------------------------- >>> >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> >> >> For additional commands, e-mail: [EMAIL PROTECTED] >>> >> >> >>> >> >> >>> >> > >>> >> > >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649 >>> >>> >>> >> Sent from the Lucene - Java Users forum at Nabble.com. >>> >> >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> >> For additional commands, e-mail: [EMAIL PROTECTED] >>> >> >>> >> >>> > >>> > >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067 >>> >>> >>> Sent from the Lucene - Java Users forum at Nabble.com. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6115360 Sent from the Lucene - Java Users forum at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]