[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

Ted Sullivan (JIRA) Wed, 25 Feb 2015 21:16:02 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337873#comment-14337873
 ]


Ted Sullivan edited comment on SOLR-7136 at 2/26/15 5:14 AM:
-------------------------------------------------------------

[~otis] I am looking at this now - I have not tested/compared these solutions 
yet. I will definitely do that. The strategy of 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is a clever one - 
it forces tokenization of phrases that are in synonyms.txt by detecting 
internal whitespace and then either forces PhraseQuery logic or automatic 
quoting when building the Lucene Query (using TypeAttributes). In that sense, 
the two ideas are similar.

The autophrasing token filter solves a slightly different problem than does 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] in that it does not 
require that a term be listed as a synonym of something else to get the correct 
semantic tokenization. Simply reducing false positives due to partial hits on a 
phrase can be a large improvement in precision and not everything has an 
obvious synonym. (That said, can you have a single term in synonyms.txt that 
does not map to anything else? I haven't tried that - maybe because it would 
need SOLR-5379 to work.) Therefore, it has value even if 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is also committed  
(or patched). Another difference is that one of the 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]  patches uses 
PhraseQuery whereas the solution of combining autophrasing with synonym mapping 
does not. How much of a performance difference this might entail I can't say - 
probably not a great deal unless we are talking about very large queries. The 
auto quoting parser patches work in a similar fashion to the 
AutoPhrasingQParserPlugin as a workaround to 
[LUCENE-2605|https://issues.apache.org/jira/browse/LUCENE-2605]. 

 I think that the query parser solution in 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is better as it 
solves the problem in a general way. To get non-synonymous phrases into this 
may require some tweaking to get the TypeAttribute to match up. I wouldn't use 
- typeAttribute.type().equals("SYNONYM") maybe typeAttribuyte.type() should be 
"PHRASE". So if both are committed, we should remove the 
AutoPhrasingQParserPlugin from this patch because both solve the exact same 
problem.

The autophrasing multi-term synonym solution does have the disadvantage of 
requiring coupling between the autophrases.txt and synonyms.txt, which the 
other solution does not. But that said, the other solution does not deal with 
multi-word terms that do not have synonyms (I suppose that you could create a 
dummy synonym but that would be difficult to maintain).

To answer your question about a 'superset' - yes if you consider that the 
solutions for multi-term synonym mapping would be equivalent. All in all, I 
would say that both solutions are useful and would add useful functionality to 
the available Solr toolset. Dealing with multi-word terms is a problem that 
many Solr deployments have and it is one that remains unresolved.




was (Author: tedsullivan):
[~otis] I am looking at this now - I have not tested/compared these solutions 
yet. I will definitely do that. The strategy of 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is a clever one - 
it forces tokenization of phrases that are in synonyms.txt by detecting 
internal whitespace and then either forces PhraseQuery logic or automatic 
quoting when building the Lucene Query (using TypeAttributes). In that sense, 
the two ideas are similar.

The autophrasing token filter solves a slightly different problem than does 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] in that it does not 
require that a term be listed as a synonym of something else to get the correct 
semantic tokenization. Simply reducing false positives due to partial hits on a 
phrase can be a large improvement in precision and not everything has an 
obvious synonym. (That said, can you have a single term in synonyms.txt that 
does not map to anything else? I haven't tried that!) Therefore, it has value 
even if [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is also 
committed  (or patched). Another difference is that one of the 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]  patches uses 
PhraseQuery whereas the solution of combining autophrasing with synonym mapping 
does not. How much of a performance difference this might entail I can't say - 
probably not a great deal unless we are talking about very large queries. The 
auto quoting parser patches work in a similar fashion to the 
AutoPhrasingQParserPlugin as a workaround to 
[LUCENE-2605|https://issues.apache.org/jira/browse/LUCENE-2605]. 

 I think that the query parser solution in 
[SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is better as it 
solves the problem in a general way. To get non-synonymous phrases into this 
may require some tweaking to get the TypeAttribute to match up. I wouldn't use 
- typeAttribute.type().equals("SYNONYM") maybe typeAttribuyte.type() should be 
"PHRASE". So if both are committed, we should remove the 
AutoPhrasingQParserPlugin from this patch because both solve the exact same 
problem.

The autophrasing multi-term synonym solution does have the disadvantage of 
requiring coupling between the autophrases.txt and synonyms.txt, which the 
other solution does not. But that said, the other solution does not deal with 
multi-word terms that do not have synonyms (I suppose that you could create a 
dummy synonym but that would be difficult to maintain).

To answer your question about a 'superset' - yes if you consider that the 
solutions for multi-term synonym mapping would be equivalent. All in all, I 
would say that both solutions are useful and would add useful functionality to 
the available Solr toolset. Dealing with multi-word terms is a problem that 
many Solr deployments have and it is one that remains unresolved.



> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases 
> that represent a single entity to be tokenized in a singular fashion. Adds 
> support for ManagedResources and Query parser auto-phrasing support given 
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing 
> multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

Reply via email to