[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace

John Berryman (JIRA) Tue, 12 Jun 2012 09:08:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293722#comment-13293722
 ]


John Berryman commented on LUCENE-2605:
---------------------------------------

There is somewhat of a workaround for this for defType=lucene. Just escape 
every whitespace with *{{\}}* . So instead of *{{new dress shoes}}* search for 
*{{new\ dress\ shoes}}*. Of course you lose the ability to use normal lucene 
syntax.

I was hoping that this workaround would also work for defType=dismax, but with 
or without the escaped whitespace, queries get interpreted the same, incorrect 
way. For instance, assume I have the following line in my synonyms.txt: 
*{{dress shoes => dress_shoes}}*. Further assume that I have a field 
*{{experiment}}* that gets analysed with synonyms. A search for *{{new dress 
shoes}}* (with or without escaped spaces) will be interpreted as 

*{{+((experiment:new)~0.01 (experiment:dress)~0.01 (experiment:shoes)~0.01) 
(experiment:"new dress_shoes"~3)~0.01}}*

The first clause is manditory and contains independently analysed tokens, so 
this will only match documents that contain "dress", "new", or "shoes", but 
never "dress shoes" because analysis takes place as expected at index time.
                
> queryparser parses on whitespace
> --------------------------------
>
>                 Key: LUCENE-2605
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2605
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/queryparser
>            Reporter: Robert Muir
>             Fix For: 4.1
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace

Reply via email to