[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

Ted Sullivan (JIRA) Mon, 23 Feb 2015 13:29:56 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333833#comment-14333833
 ]


Ted Sullivan edited comment on SOLR-7136 at 2/23/15 9:29 PM:
-------------------------------------------------------------

Yes Ahmet - that is correct, this patch includes a QParserPlugin as a 
workaround for LUCENE-2605 also mentioned in SOLR-5379. 
(AutophrasingQParserPlugin) The Query Parser solution published by Nolan Lawson 
and submitted as SOLR-4381 is a good solution too.  Note however that the 
AutoPhrasing parser first solves a problem of tokenizing phrases that represent 
single entities as single tokens - making the Lucene docID lookup cleaner.  
Solutions like SOLR-5379 solve this indirectly and may have different edge 
cases because not all phrases are meant to represent single entities. For 
example, generalized phrase processing paradigms like mm or ps may not deal as 
precisely with phrases that include a multi-term entity with something else 
like "New York City restaurants". Since it is part of an analysis pipeline, the 
AutophrasingTokenFilter can be used in conjunction with the SynonymTokenFilter 
to solve the multi-term synonym problem but that is an architectural solution. 
In other words this TokenFilter was not written to solve the multi-term synonym 
problem - that is a side benefit of what it does, given the nature of Lucene 
analysis chains. It has other benefits as well just by forcing exact-match 
semantics on phrases that should be treated as semantic or linguistic entities. 
It does have the downside of requiring autophrase lists, but so then does 
synonym processing.


was (Author: tedsullivan):
Yes Ahmet - that is correct, this patch includes a QParserPlugin as a 
workaround for LUCENE-2605 also mentioned in SOLR-5379. 
(AutophrasingQParserPlugin) The Query Parser solution published by Nolan Lawson 
and submitted as SOLR-4381 is a good solution too.  Note however that the 
AutoPhrasing parser first solves a problem of tokenizing phrases that represent 
single entities as single tokens - making the Lucene docID lookup cleaner.  
Solutions like SOLR-5379 solve this indirectly and may have different edge 
cases because not all phrases are meant to represent single entities. For 
example, generalized phrase processing paradigms like mm or ps may not deal as 
precisely with phrases that include a multi-term entity with something else 
like "New York City restaurants". Since it is part of an analysis pipeline, the 
AutophrasingTokenFilter can be used in conjunction with the SynonymTokenFilter 
to solve the multi-term synonym problem but that is an architectural solution. 
In other words this TokenFilter was not written to solve the multi-term problem 
- that is a side benefit of what it does, given the nature of Lucene analysis 
chains. It has other benefits as well just by forcing exact-match semantics on 
phrases that should be treated as semantic or linguistic entities. It does have 
the downside of requiring autophrase lists, but so then does synonym processing.

> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases 
> that represent a single entity to be tokenized in a singular fashion. Adds 
> support for ManagedResources and Query parser auto-phrasing support given 
> Lucene-2605.
> The rationale for this Token Filter and its use in solving the long standing 
> multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

Reply via email to