[
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337873#comment-14337873
]
Ted Sullivan edited comment on SOLR-7136 at 2/26/15 5:14 AM:
-------------------------------------------------------------
[~otis] I am looking at this now - I have not tested/compared these solutions
yet. I will definitely do that. The strategy of
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is a clever one -
it forces tokenization of phrases that are in synonyms.txt by detecting
internal whitespace and then either forces PhraseQuery logic or automatic
quoting when building the Lucene Query (using TypeAttributes). In that sense,
the two ideas are similar.
The autophrasing token filter solves a slightly different problem than does
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] in that it does not
require that a term be listed as a synonym of something else to get the correct
semantic tokenization. Simply reducing false positives due to partial hits on a
phrase can be a large improvement in precision and not everything has an
obvious synonym. (That said, can you have a single term in synonyms.txt that
does not map to anything else? I haven't tried that - maybe because it would
need SOLR-5379 to work.) Therefore, it has value even if
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is also committed
(or patched). Another difference is that one of the
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] patches uses
PhraseQuery whereas the solution of combining autophrasing with synonym mapping
does not. How much of a performance difference this might entail I can't say -
probably not a great deal unless we are talking about very large queries. The
auto quoting parser patches work in a similar fashion to the
AutoPhrasingQParserPlugin as a workaround to
[LUCENE-2605|https://issues.apache.org/jira/browse/LUCENE-2605].
I think that the query parser solution in
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is better as it
solves the problem in a general way. To get non-synonymous phrases into this
may require some tweaking to get the TypeAttribute to match up. I wouldn't use
- typeAttribute.type().equals("SYNONYM") maybe typeAttribuyte.type() should be
"PHRASE". So if both are committed, we should remove the
AutoPhrasingQParserPlugin from this patch because both solve the exact same
problem.
The autophrasing multi-term synonym solution does have the disadvantage of
requiring coupling between the autophrases.txt and synonyms.txt, which the
other solution does not. But that said, the other solution does not deal with
multi-word terms that do not have synonyms (I suppose that you could create a
dummy synonym but that would be difficult to maintain).
To answer your question about a 'superset' - yes if you consider that the
solutions for multi-term synonym mapping would be equivalent. All in all, I
would say that both solutions are useful and would add useful functionality to
the available Solr toolset. Dealing with multi-word terms is a problem that
many Solr deployments have and it is one that remains unresolved.
was (Author: tedsullivan):
[~otis] I am looking at this now - I have not tested/compared these solutions
yet. I will definitely do that. The strategy of
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is a clever one -
it forces tokenization of phrases that are in synonyms.txt by detecting
internal whitespace and then either forces PhraseQuery logic or automatic
quoting when building the Lucene Query (using TypeAttributes). In that sense,
the two ideas are similar.
The autophrasing token filter solves a slightly different problem than does
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] in that it does not
require that a term be listed as a synonym of something else to get the correct
semantic tokenization. Simply reducing false positives due to partial hits on a
phrase can be a large improvement in precision and not everything has an
obvious synonym. (That said, can you have a single term in synonyms.txt that
does not map to anything else? I haven't tried that!) Therefore, it has value
even if [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is also
committed (or patched). Another difference is that one of the
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] patches uses
PhraseQuery whereas the solution of combining autophrasing with synonym mapping
does not. How much of a performance difference this might entail I can't say -
probably not a great deal unless we are talking about very large queries. The
auto quoting parser patches work in a similar fashion to the
AutoPhrasingQParserPlugin as a workaround to
[LUCENE-2605|https://issues.apache.org/jira/browse/LUCENE-2605].
I think that the query parser solution in
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is better as it
solves the problem in a general way. To get non-synonymous phrases into this
may require some tweaking to get the TypeAttribute to match up. I wouldn't use
- typeAttribute.type().equals("SYNONYM") maybe typeAttribuyte.type() should be
"PHRASE". So if both are committed, we should remove the
AutoPhrasingQParserPlugin from this patch because both solve the exact same
problem.
The autophrasing multi-term synonym solution does have the disadvantage of
requiring coupling between the autophrases.txt and synonyms.txt, which the
other solution does not. But that said, the other solution does not deal with
multi-word terms that do not have synonyms (I suppose that you could create a
dummy synonym but that would be difficult to maintain).
To answer your question about a 'superset' - yes if you consider that the
solutions for multi-term synonym mapping would be equivalent. All in all, I
would say that both solutions are useful and would add useful functionality to
the available Solr toolset. Dealing with multi-word terms is a problem that
many Solr deployments have and it is one that remains unresolved.
> Add an AutoPhrasing TokenFilter
> -------------------------------
>
> Key: SOLR-7136
> URL: https://issues.apache.org/jira/browse/SOLR-7136
> Project: Solr
> Issue Type: New Feature
> Reporter: Ted Sullivan
> Attachments: SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases
> that represent a single entity to be tokenized in a singular fashion. Adds
> support for ManagedResources and Query parser auto-phrasing support given
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing
> multi-term synonym problem in Lucene Solr has been documented online.
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]