[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

Jack Krupansky (JIRA) Thu, 16 Aug 2012 09:41:39 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436082#comment-13436082
 ]


Jack Krupansky commented on SOLR-3589:
--------------------------------------

Be careful not to confuse dismax and edismax. They are two different query 
parsers, with different goals.

One of edismax's goals was to support "fielded queries" (e.g., "title:abc AND 
date:123") and the full Lucene query syntax. No typical analyzer will be able 
to tell you that title and date are field names.

Not "English-centric", but European/Latin-centric for sure. The edismax and 
classic Lucene query parsers share that heritage, based on whitespace, but the 
dismax query parser doesn't "suffer" from that same need to parse field names 
and operators.

There is no question that better query parser support is needed for 
non-European/Latin languages, but that requires careful, high-level, overall 
design, which is a tall order for a fast-paced open source community where 
features tend to be looked at in isolation.

One clarification...

bq. assumes that a term is a whitespace-delimited string

Yes and no. We need to be careful about distinguishing a "source term" - what 
the parser recognizes, which is whitespace delimited, from "analyzed terms" 
which are recognized and output by the field type analyzers. There is no 
requirement that the output terms be whitespace-delimited or that the input to 
an anlyzer be whitespace-delimited. So, the theory has been that even a 
whitespace-centric complex-structure query parser can also handle, for example, 
Chinese text. Obviously that hasn't worked out as cleanly as desired and more 
work is needed.

                
> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3589
>                 URL: https://issues.apache.org/jira/browse/SOLR-3589
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 3.6
>            Reporter: Tom Burton-West
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

Reply via email to