Re: Lucene QueryParser and Analyzer

Robert Muir Tue, 11 May 2010 16:56:50 -0700

FYI: I opened a jira issue for this bug here:
https://issues.apache.org/jira/browse/LUCENE-2458


On Thu, Apr 29, 2010 at 7:01 PM, Wei Ho <we...@princeton.edu> wrote:
> I think I've figured out what the problem is. Given the inputs,
>
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>
> Input1 gets parsed as
> Query1: (text: "C1C2  C3C4  C5C6  C7  C8C9C10")
> whereas Input2 gets parsed as
> Query2: (text: "C1C2") (text: "C3C4") (text: "C5C6") (text: "C7") (text:
> "C8C9C10")
>
> That is, Lucene constructs the query and then pass the query text through
> the analyzer. Is there any way to
> force QueryParser to pass the input string through the analyzer before
> creating the query? That is, force Lucene
> to create Query2 for both Input1 and Input2.
>
> Thanks,
> Wei
>
>
> -------- Original Message  --------
> Subject: Re: Lucene QueryParser and Analyzer
> From: Sudarsan, Sithu D. <sithu.sudar...@fda.hhs.gov>
> To: java-user@lucene.apache.org
> Date: 4/29/2010 4:54 PM
>>
>> -------sample code-------------
>>
>>>>
>>>> Analyzer analyzer = new LingPipeAnalyzer();
>>>> Searcher searcher = new IndexSearcher(directory);
>>>> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
>>>> SEARCH_FIELDS, analyzer);
>>>> Query query = qParser.parse(queryLine[1]);
>>>> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
>>>>
>>
>> qParser will use the analyzer LingPipeAnalyzer() before forming the
>> query.
>>
>>
>> Sincerely,
>> Sithu D Sudarsan
>>
>>
>> -----Original Message-----
>> From: Wei Ho [mailto:we...@princeton.edu]
>> Sent: Thursday, April 29, 2010 4:44 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene QueryParser and Analyzer
>>
>> Sorry, I guess "discarding the punctuation" was a bit misleading.
>> I meant that given the two input strings,
>>
>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>
>> The analyzer I implemented tokenizes both Input1 and Input2 as "C1C2",
>> "C3C4", "C5C6", "C7", "C8C9C10" - that is, it doesn't include the
>> punctuation in the tokenization. I'm assuming that QueryParser is simply
>>
>> passing the entire input string to the analyzer and taking the tokens,
>> in which case Input1 and Input2 should be considered identifcal. Does
>> QueryParser doing any sort of pre-processing or filtering beforehand? If
>>
>> so, how can I turn it off?
>>
>> Aside from stopping tokens at punctuations, my analyzer is also doing
>> Chinese word segmentation, so I'd like to be sure that QueryParser is
>> using the analyzer the way I expect it to.
>>
>> Thanks,
>> Wei
>>
>>
>>
>> -------- Original Message  --------
>> Subject: Re: Lucene QueryParser and Analyzer
>> From: Sudarsan, Sithu D.<sithu.sudar...@fda.hhs.gov>
>> To: java-user@lucene.apache.org
>> Date: 4/29/2010 4:08 PM
>>
>>>
>>> If so,
>>>
>>> Input1:  c1c2c3c4c5c6c7....
>>> Input2: c1c2 c3c4 ...
>>>
>>> I guess, they are different! Add a whitespace after commas and see if
>>> that works...
>>>
>>> Sincerely,
>>> Sithu D Sudarsan
>>>
>>>
>>> -----Original Message-----
>>> From: Wei Ho [mailto:we...@princeton.edu]
>>> Sent: Thursday, April 29, 2010 4:04 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Lucene QueryParser and Analyzer
>>>
>>> No, there is no whitespace after the comma in Input1
>>>
>>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>>
>>> Input1 is basically one big long word with commas and Chinese
>>>
>>
>> characters
>>
>>>
>>> one after the other. Input2 is where I manually separated the string
>>> into the component terms by replacing the comma with whitespace. My
>>> confusion stems from the fact that I thought it should not matter
>>>
>>
>> since
>>
>>>
>>> the analyzer should be discarding the punctuation anyway? So the
>>> tokenization process should be the same for both Input1 and Input2? If
>>> that is not the case, what do I need to change?
>>>
>>> Thanks,
>>> Wei Ho
>>>
>>> -------- Original Message  --------
>>> Subject: Re: Lucene QueryParser and Analyzer
>>> From: Sudarsan, Sithu D.<sithu.sudar...@fda.hhs.gov>
>>> To: java-user@lucene.apache.org
>>> Date: 4/29/2010 3:54 PM
>>>
>>>
>>>>
>>>> Hi,
>>>>
>>>> Is there a whitespace after the comma?
>>>>
>>>>
>>>> Sincerely,
>>>> Sithu D Sudarsan
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wei Ho [mailto:we...@princeton.edu]
>>>> Sent: Thursday, April 29, 2010 3:51 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Lucene QueryParser and Analyzer
>>>>
>>>> Hello,
>>>>
>>>> I'm using Lucene to index and search through a collection of Chinese
>>>> documents. However, I'm noticing an odd behavior in query
>>>> parsing/searching.
>>>>
>>>> Given the two queries below:
>>>>
>>>> (Ci refers to Chinese character i)
>>>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>>>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>>>
>>>> Input1 returns absolutely nothing, while Input2 (replacing the commas
>>>> with spaces) works as expected. I'm a bit confused why this would be
>>>> happening - it seems that QueryParser uses the Analyzer passed to it
>>>>
>>>>
>>>
>>> to
>>>
>>>
>>>>
>>>> tokenize the input query string, so if the Analyzer ignores the
>>>> punctuations, it seems that Input1 and Input2 should return identical
>>>> results. Is there some pre-Analyzer filtering or whatever that
>>>> QueryParser does? I've tried this with the StandardAnalyzer,
>>>> SmartChineseAnalyzer, and an analyzer that I implemented which
>>>> explicitly skips over punctuations and whitespaces in tokenizing the
>>>> query string, but to no avail.
>>>>
>>>>
>>>> -----------------------------------
>>>>
>>>> I'm probably just doing something dumb, but any help would be greatly
>>>> appreciated!
>>>>
>>>> Thanks,
>>>> Wei Ho
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene QueryParser and Analyzer

Reply via email to