[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

Ted Sullivan (JIRA) Tue, 24 Feb 2015 05:00:39 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333833#comment-14333833
 ]


Ted Sullivan edited comment on SOLR-7136 at 2/24/15 12:59 PM:
--------------------------------------------------------------

Yes Ahmet - that is correct, this patch includes a QParserPlugin as a 
workaround for LUCENE-2605 also mentioned in SOLR-5379. 
(AutophrasingQParserPlugin) The Query Parser solution published by Nolan Lawson 
and submitted as SOLR-4381 is a good solution too.  In fact, the "Solr guys 
don't give a flying F" comment on HN was in response to the fact that SOLR-4381 
which was filed over 2 years ago is still not committed. 

Note however that the AutoPhrasing parser first solves a problem of tokenizing 
phrases that represent single entities as single tokens - making the Lucene 
docID lookup cleaner.  Solutions like SOLR-5379 solve this indirectly and may 
have different edge cases because not all phrases are meant to represent single 
entities. For example, generalized phrase processing paradigms like mm or ps 
may not deal as precisely with phrases that include a multi-term entity with 
something else like "New York City restaurants". Since it is part of an 
analysis pipeline, the AutophrasingTokenFilter can be used in conjunction with 
the SynonymTokenFilter to solve the multi-term synonym problem but that is an 
architectural solution. In other words this TokenFilter was not written to 
solve the multi-term synonym problem - that is a side benefit of what it does, 
given the nature of Lucene analysis chains.

 It has other benefits as well just by forcing exact-match semantics on phrases 
that should be treated as semantic or linguistic entities. It does have the 
downside of requiring autophrase lists, but so then does synonym processing.


was (Author: tedsullivan):
Yes Ahmet - that is correct, this patch includes a QParserPlugin as a 
workaround for LUCENE-2605 also mentioned in SOLR-5379. 
(AutophrasingQParserPlugin) The Query Parser solution published by Nolan Lawson 
and submitted as SOLR-4381 is a good solution too.  In fact, the "Solr guys 
don't give a flying F" comment on HN was in response to the fact that SOLR-4381 
which was filed over 2 years ago is still not committed. Note however that the 
AutoPhrasing parser first solves a problem of tokenizing phrases that represent 
single entities as single tokens - making the Lucene docID lookup cleaner.  
Solutions like SOLR-5379 solve this indirectly and may have different edge 
cases because not all phrases are meant to represent single entities. For 
example, generalized phrase processing paradigms like mm or ps may not deal as 
precisely with phrases that include a multi-term entity with something else 
like "New York City restaurants". Since it is part of an analysis pipeline, the 
AutophrasingTokenFilter can be used in conjunction with the SynonymTokenFilter 
to solve the multi-term synonym problem but that is an architectural solution. 
In other words this TokenFilter was not written to solve the multi-term synonym 
problem - that is a side benefit of what it does, given the nature of Lucene 
analysis chains. It has other benefits as well just by forcing exact-match 
semantics on phrases that should be treated as semantic or linguistic entities. 
It does have the downside of requiring autophrase lists, but so then does 
synonym processing.

> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases 
> that represent a single entity to be tokenized in a singular fashion. Adds 
> support for ManagedResources and Query parser auto-phrasing support given 
> Lucene-2605.
> The rationale for this Token Filter and its use in solving the long standing 
> multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

Reply via email to