FYI: I opened a jira issue for this bug here: https://issues.apache.org/jira/browse/LUCENE-2458
On Thu, Apr 29, 2010 at 7:01 PM, Wei Ho <we...@princeton.edu> wrote: > I think I've figured out what the problem is. Given the inputs, > > Input1: C1C2,C3C4,C5C6,C7,C8C9C10 > Input2: C1C2 C3C4 C5C6 C7 C8C9C10 > > Input1 gets parsed as > Query1: (text: "C1C2 C3C4 C5C6 C7 C8C9C10") > whereas Input2 gets parsed as > Query2: (text: "C1C2") (text: "C3C4") (text: "C5C6") (text: "C7") (text: > "C8C9C10") > > That is, Lucene constructs the query and then pass the query text through > the analyzer. Is there any way to > force QueryParser to pass the input string through the analyzer before > creating the query? That is, force Lucene > to create Query2 for both Input1 and Input2. > > Thanks, > Wei > > > -------- Original Message -------- > Subject: Re: Lucene QueryParser and Analyzer > From: Sudarsan, Sithu D. <sithu.sudar...@fda.hhs.gov> > To: java-user@lucene.apache.org > Date: 4/29/2010 4:54 PM >> >> -------sample code------------- >> >>>> >>>> Analyzer analyzer = new LingPipeAnalyzer(); >>>> Searcher searcher = new IndexSearcher(directory); >>>> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30, >>>> SEARCH_FIELDS, analyzer); >>>> Query query = qParser.parse(queryLine[1]); >>>> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs; >>>> >> >> qParser will use the analyzer LingPipeAnalyzer() before forming the >> query. >> >> >> Sincerely, >> Sithu D Sudarsan >> >> >> -----Original Message----- >> From: Wei Ho [mailto:we...@princeton.edu] >> Sent: Thursday, April 29, 2010 4:44 PM >> To: java-user@lucene.apache.org >> Subject: Re: Lucene QueryParser and Analyzer >> >> Sorry, I guess "discarding the punctuation" was a bit misleading. >> I meant that given the two input strings, >> >> Input1: C1C2,C3C4,C5C6,C7,C8C9C10 >> Input2: C1C2 C3C4 C5C6 C7 C8C9C10 >> >> The analyzer I implemented tokenizes both Input1 and Input2 as "C1C2", >> "C3C4", "C5C6", "C7", "C8C9C10" - that is, it doesn't include the >> punctuation in the tokenization. I'm assuming that QueryParser is simply >> >> passing the entire input string to the analyzer and taking the tokens, >> in which case Input1 and Input2 should be considered identifcal. Does >> QueryParser doing any sort of pre-processing or filtering beforehand? If >> >> so, how can I turn it off? >> >> Aside from stopping tokens at punctuations, my analyzer is also doing >> Chinese word segmentation, so I'd like to be sure that QueryParser is >> using the analyzer the way I expect it to. >> >> Thanks, >> Wei >> >> >> >> -------- Original Message -------- >> Subject: Re: Lucene QueryParser and Analyzer >> From: Sudarsan, Sithu D.<sithu.sudar...@fda.hhs.gov> >> To: java-user@lucene.apache.org >> Date: 4/29/2010 4:08 PM >> >>> >>> If so, >>> >>> Input1: c1c2c3c4c5c6c7.... >>> Input2: c1c2 c3c4 ... >>> >>> I guess, they are different! Add a whitespace after commas and see if >>> that works... >>> >>> Sincerely, >>> Sithu D Sudarsan >>> >>> >>> -----Original Message----- >>> From: Wei Ho [mailto:we...@princeton.edu] >>> Sent: Thursday, April 29, 2010 4:04 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: Lucene QueryParser and Analyzer >>> >>> No, there is no whitespace after the comma in Input1 >>> >>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10 >>> Input2: C1C2 C3C4 C5C6 C7 C8C9C10 >>> >>> Input1 is basically one big long word with commas and Chinese >>> >> >> characters >> >>> >>> one after the other. Input2 is where I manually separated the string >>> into the component terms by replacing the comma with whitespace. My >>> confusion stems from the fact that I thought it should not matter >>> >> >> since >> >>> >>> the analyzer should be discarding the punctuation anyway? So the >>> tokenization process should be the same for both Input1 and Input2? If >>> that is not the case, what do I need to change? >>> >>> Thanks, >>> Wei Ho >>> >>> -------- Original Message -------- >>> Subject: Re: Lucene QueryParser and Analyzer >>> From: Sudarsan, Sithu D.<sithu.sudar...@fda.hhs.gov> >>> To: java-user@lucene.apache.org >>> Date: 4/29/2010 3:54 PM >>> >>> >>>> >>>> Hi, >>>> >>>> Is there a whitespace after the comma? >>>> >>>> >>>> Sincerely, >>>> Sithu D Sudarsan >>>> >>>> >>>> -----Original Message----- >>>> From: Wei Ho [mailto:we...@princeton.edu] >>>> Sent: Thursday, April 29, 2010 3:51 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Lucene QueryParser and Analyzer >>>> >>>> Hello, >>>> >>>> I'm using Lucene to index and search through a collection of Chinese >>>> documents. However, I'm noticing an odd behavior in query >>>> parsing/searching. >>>> >>>> Given the two queries below: >>>> >>>> (Ci refers to Chinese character i) >>>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10 >>>> Input2: C1C2 C3C4 C5C6 C7 C8C9C10 >>>> >>>> Input1 returns absolutely nothing, while Input2 (replacing the commas >>>> with spaces) works as expected. I'm a bit confused why this would be >>>> happening - it seems that QueryParser uses the Analyzer passed to it >>>> >>>> >>> >>> to >>> >>> >>>> >>>> tokenize the input query string, so if the Analyzer ignores the >>>> punctuations, it seems that Input1 and Input2 should return identical >>>> results. Is there some pre-Analyzer filtering or whatever that >>>> QueryParser does? I've tried this with the StandardAnalyzer, >>>> SmartChineseAnalyzer, and an analyzer that I implemented which >>>> explicitly skips over punctuations and whitespaces in tokenizing the >>>> query string, but to no avail. >>>> >>>> >>>> ----------------------------------- >>>> >>>> I'm probably just doing something dumb, but any help would be greatly >>>> appreciated! >>>> >>>> Thanks, >>>> Wei Ho >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org