[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace

Jack Krupansky (JIRA) Tue, 12 Jun 2012 09:34:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293748#comment-13293748
 ]


Jack Krupansky commented on LUCENE-2605:
----------------------------------------

My thought on the original issue is that most query parsers should accumulate 
adjacent terms without intervening operators as a "term list" (quoted phrases 
would be a second level of term list) and that there needs to be a "list" 
interface for query term analysis.

Rather than simply present a raw text stream for the sequence/list of terms, 
each term would be fed into the token stream with an attribute that indicates 
which source term it belongs to.

The synonym processor would see a clean flow of terms and do its processing, 
but would also need to associate an id with each term of a multi-term synonym 
phrase so that multiple multi-word synonym choices for the same input term(s) 
don't get mixed up (i.e., multiple tokens at the same position with no 
indication of which original synonym phrase they came from).

By having those ID's for each multi-term synonym phrase, the caller of the list 
analyzer could then recontruct the tree of "OR" expressions for the various 
multi-term synonym phrases.

                
> queryparser parses on whitespace
> --------------------------------
>
>                 Key: LUCENE-2605
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2605
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/queryparser
>            Reporter: Robert Muir
>             Fix For: 4.1
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace

Reply via email to