[
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436082#comment-13436082
]
Jack Krupansky commented on SOLR-3589:
--------------------------------------
Be careful not to confuse dismax and edismax. They are two different query
parsers, with different goals.
One of edismax's goals was to support "fielded queries" (e.g., "title:abc AND
date:123") and the full Lucene query syntax. No typical analyzer will be able
to tell you that title and date are field names.
Not "English-centric", but European/Latin-centric for sure. The edismax and
classic Lucene query parsers share that heritage, based on whitespace, but the
dismax query parser doesn't "suffer" from that same need to parse field names
and operators.
There is no question that better query parser support is needed for
non-European/Latin languages, but that requires careful, high-level, overall
design, which is a tall order for a fast-paced open source community where
features tend to be looked at in isolation.
One clarification...
bq. assumes that a term is a whitespace-delimited string
Yes and no. We need to be careful about distinguishing a "source term" - what
the parser recognizes, which is whitespace delimited, from "analyzed terms"
which are recognized and output by the field type analyzers. There is no
requirement that the output terms be whitespace-delimited or that the input to
an anlyzer be whitespace-delimited. So, the theory has been that even a
whitespace-centric complex-structure query parser can also handle, for example,
Chinese text. Obviously that hasn't worked out as cleanly as desired and more
work is needed.
> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by
> the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is
> ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]