[
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333833#comment-14333833
]
Ted Sullivan edited comment on SOLR-7136 at 2/23/15 9:29 PM:
-------------------------------------------------------------
Yes Ahmet - that is correct, this patch includes a QParserPlugin as a
workaround for LUCENE-2605 also mentioned in SOLR-5379.
(AutophrasingQParserPlugin) The Query Parser solution published by Nolan Lawson
and submitted as SOLR-4381 is a good solution too. Note however that the
AutoPhrasing parser first solves a problem of tokenizing phrases that represent
single entities as single tokens - making the Lucene docID lookup cleaner.
Solutions like SOLR-5379 solve this indirectly and may have different edge
cases because not all phrases are meant to represent single entities. For
example, generalized phrase processing paradigms like mm or ps may not deal as
precisely with phrases that include a multi-term entity with something else
like "New York City restaurants". Since it is part of an analysis pipeline, the
AutophrasingTokenFilter can be used in conjunction with the SynonymTokenFilter
to solve the multi-term synonym problem but that is an architectural solution.
In other words this TokenFilter was not written to solve the multi-term synonym
problem - that is a side benefit of what it does, given the nature of Lucene
analysis chains. It has other benefits as well just by forcing exact-match
semantics on phrases that should be treated as semantic or linguistic entities.
It does have the downside of requiring autophrase lists, but so then does
synonym processing.
was (Author: tedsullivan):
Yes Ahmet - that is correct, this patch includes a QParserPlugin as a
workaround for LUCENE-2605 also mentioned in SOLR-5379.
(AutophrasingQParserPlugin) The Query Parser solution published by Nolan Lawson
and submitted as SOLR-4381 is a good solution too. Note however that the
AutoPhrasing parser first solves a problem of tokenizing phrases that represent
single entities as single tokens - making the Lucene docID lookup cleaner.
Solutions like SOLR-5379 solve this indirectly and may have different edge
cases because not all phrases are meant to represent single entities. For
example, generalized phrase processing paradigms like mm or ps may not deal as
precisely with phrases that include a multi-term entity with something else
like "New York City restaurants". Since it is part of an analysis pipeline, the
AutophrasingTokenFilter can be used in conjunction with the SynonymTokenFilter
to solve the multi-term synonym problem but that is an architectural solution.
In other words this TokenFilter was not written to solve the multi-term problem
- that is a side benefit of what it does, given the nature of Lucene analysis
chains. It has other benefits as well just by forcing exact-match semantics on
phrases that should be treated as semantic or linguistic entities. It does have
the downside of requiring autophrase lists, but so then does synonym processing.
> Add an AutoPhrasing TokenFilter
> -------------------------------
>
> Key: SOLR-7136
> URL: https://issues.apache.org/jira/browse/SOLR-7136
> Project: Solr
> Issue Type: New Feature
> Reporter: Ted Sullivan
> Attachments: SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases
> that represent a single entity to be tokenized in a singular fashion. Adds
> support for ManagedResources and Query parser auto-phrasing support given
> Lucene-2605.
> The rationale for this Token Filter and its use in solving the long standing
> multi-term synonym problem in Lucene Solr has been documented online.
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]